
DOCUMENT PROCESSOR, DOCUMENT CLASSIFICATION DEVICE, 
DOCUMENT PROCESSING METHOD, DOCUMENT CLASSIFICATION METHOD, 
AND COMPUTER-READABLE RECORDING MEDIUM 
FOR RECORDING PROGRAMS FOR 
5 EXECUTING THE METHODS ON A COMPUTER 



FIELD OF THE INVENTION 

The present invention relates to a document processor for 
displaying and printing multiple input document data in a 

10 predetermined format, a document processing method, and a 
computer-readable recording medium for recording a program to 
execute the method on a computer. Furthermore, this invention 
relates to a document classification device and a document 
classification method for- classifying multiple input document 

15 data based on the contents thereof, and particularly for 
refining classification categories calculated during document 
classification, and to a computer-readable recording medium for 
recording a program to execute the method on a computer. 

20 PACKGfiOUNP QF THE INVENTION 

Various document classification devices and document 
retrieval devices have been developed in recent years . The 
proliferation of network technology, such as the Internet, has 
made it possible to access a huge amount of electronic documents, 
25 domestically and overseas, and there has been a proportionate 
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rapid expansion in the amount of data which is stored 
electronically. Accordingly, there is an increasing need for 
intellectual operations such as classifying large collections 
of document data into meaningful categories. 
5 The benefits of classifying large amounts of document 

data according to their meaning are as follows. Firstly, it 
makes it easier to retrieve data. Retrieval becomes relatively 
easy since vast groups of documents can be retrieved using 
category names as clues. 

10 Secondly, entire groups of data can be grasped. That is, 

it is possible to grasp the contents (individual 
classifications) of an entire cluster of documents. However, 
when a large amount of document data is classified by an operator, 
although accurate classification can be achieved, 

15 classification requires enormous manpower and time. 
Consequently, in view of the huge amount of documents stored 
in recent years, devices for automatically classifying document 
data have been proposed. 

As an example of a conventional device for automatically 

20 classifying documents, Japanese Patent Application Laid-open 
(JP-A) No. 7-36897 discloses a device which defines a document 
as a document vector characterized by a word, uses clustering 
to group these document vectors, and automatically classifies 
the documents based on the grouped document vectors . . 

25 Furthermore, in "Projections for Efficient Document 
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Clustering (Authors: Hinrich Schutze and Craing Silverstein, 
Academy: ACM, Title of Paper : Proceedings of SIGIR, pages: 78-81, 
Year of Publication: 1997) " documents are classified in dormant 
meaning space. Other conceivable methods include using a 
5 probability theory approach, etc. 

Furthermore, in recent years, the proliferation of the 
Internet and the like has made it possible to access large 
q amounts of document clusters, and as a result, there is an 

s £ increasing need to be able use these document clusters 

ftj 10 effectively, and in accordance with the intentions of a variety 
Jp of users. To accomplish this, an intellectual operation is 

si- starting to be used in which a large amount of document clusters 

sU is classified into meaningful categories, and the structure of 

'"^ the document clusters is grasped. However, when this type of 

15 classification is performed manually, enormous manpower and 
time are required. Further, since only the classifier knows 
how to classify the document data, classification standard 
change when the person responsible for classification is 
replaced . 

20 Consequently, there is a demand for a document 

classification device capable of automatically classifying 
groups of documents according to the same type of classification 
standards used by humans . For example, as disclosed in Japanese 
Patent Application Laid-open (JP-A) No. 7-114572, a document 

25 classification device capable of automatically extracting a 
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word characteristic vector from a document, and classifying the 
document based on the characteristic vector, thereby making it 
possible to automatically classify the documents using 
meaningful differences . 

However, since the conventional document classification 
device described above uses a method for statistically 
classifying documents arranged in multi-dimensional space 
essentially comprising words, the result of the classification 
is nothing more than the statistically determined behaviour of 



ry 10 the words. Consequently, clusters (partial groups of 

m ) 

d p individual classified documents) calculated after 

classification are sometimes incomprehensible to the operator 
iU (user) . 

^-J A further problem is that the question of what kind of 

B 15 classification is appropriate depends on the characteristics 
of the document clusterings to be classified and the intentions 
of the user, making it difficult to define an appropriate 
classification. In particular, when grasping entire data 
groups as mentioned above, the type of classification required 
20 will differ depending on the widely varying intentions of the 
operators, and it will be difficult to obtain the result desired 
by the operator in a single classification. 

Thus, the problem can be interpreted by saying that a 
document classification result includes a great amount of noise, 
25 only one part of which is of use to the operator. 
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Furthermore, the conventional technology does not 
consider the constitutional units of the document, and in a case 
where the structure of a document is partitioned by one or 
multiple period symbols, titles, and the like, multiple topics 
5 and meanings are contained in a single document. This results 
in problems that it is difficult for a user to understand the 
classification categories, the category may be limited to a 
^ specific topic or specific meaning, or the document may be 

IS classified under a category different to that intended by the 

§1j 10 user. 

§ 

4= A context-dependent automatic classification device is 

disclosed in Japanese Patent Application Laid-open (JP-A) No. 
ly 6-176064, and aims to increase classification precision by 

H i 

Si automatically classifying documents in consideration of the 

15 conclusive data therein, but essentially does not solve the 
problems mentioned above . 

Furthermore, conventional document processors, such as 
the document classification device and document retrieve device 
described above, merely classify or retrieve documents, and 
20 give no consideration to further analysis of information hidden 
in the document clusters. Consequently, they have a 
disadvantage that a separate analyzing device must be used to 
analyze information hidden in the document clusters. 

Furthermore, the operator who wishes to analyze the 
25 information does not perform classification and retrieval as 



an end in itself / but simply as an intermediate Step during his 
analysis of the information. After classification and 
retrieval, in order to grasp the result more easily it is usually 
necessary to derive a meaningful result from the information 
5 analysis by repeating a variety of other processes, such as 
maximizing the practical usefulness of the information included 
in the original document, rearranging the result, carrying out 
totalization and statistical processing, and drawing up charts 
and graphs based on the results. 

10 Furthermore, table-calculating software is sometimes 

needed when analyzing information about numerical data. 
However, table-calculating software was originally developed 
to handle numerical data, and is not sufficiently effective for 
analyzing textual data, particularly when the analysis concerns 

15 the meaning of documents. 

SUMMARY OF THE INVENTION 

This invention has been achieved in order to solve the 
problems of the conventional examples described above. It is 

20 a first object of the present invention to provide a document 
processor, a document processing method, and a computer- 
readable recording medium storing programs for executing the 
method on a computer, for carrying out analysis concerning the 
meaning of documents, not simply by outputting the results of 

25 fixed functions such as classification and retrieval, but by 
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supporting a complete range of information analysis. 

To solve the problems of the conventional example 
described above, it is a second object of the present invention 
to provide a document classification device and a document 
classification method capable of momentarily determining what 
type of contents are contained in a given document cluster, and 
a computer-readable recording medium for storing programs for 
executing the method on a computer. 

Furthermore, to solve the problems of the conventional 
example described above, it is a third object of the present 
invention to provide a document classification device and a 
document classification method wherein, when one document 
contains multiple topics and meanings, these can be classified 
into categories according to specific topics and meanings, so 
that the classifications do not differ from categories desired 
by a user, thereby enabling the user to easily comprehend the 
classification categories, and a computer-readable recording 
medium for storing programs for executing the method on a 
computer . 

In order to solve the problems mentioned above, the 
document processor according to one aspect of the present 
invention for displaying and printing in a predetermined format 
multiple input document data, comprises a document memory unit 
for storing input document data; a selection unit for selecting 
all or part of document data stored in the documents memory unit; 



a characteristics extraction unit for extracting data relating 
to characteristics of letter rows from all or part of the 
document data selected by the selection unit; a work processing 
unit for work-processing all or part of the document data based 
on the data relating to characteristics of letter rows extracted 
by the characteristics extraction unit; and an output unit for 
outputting all or part of the document data work-processed by 
the work processing unit. 

According to the above aspect of this invention, when 
analyzing documents according to their meanings, rather than 
merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

Further, the output unit of the document processor 
comprises an item value set unit for setting a plurality of item 
values based on the contents of all or part of the document data 
work-processed by the work-processing unit; and a totalization 
unit for totalizing all or part of the document data for each 
item value set by the item value set unit. Furthermore, the 
output unit outputs all or part of the document data in the format 
of a table having an item value as at least one axis. 

Hence the result of the work-processing can easily be 
expressed in a cross table, and the contents of the information 
can easily be grasped. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 
result of the analysis, the entire information analysis 
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operation can be supported. 

Further, the output unit outputs all or part of the 
document data work-processed by the work processing unit 
together with all or part of the document data in its state prior 
to work-processing by the work processing unit. 

Hence data to be work-processed and other data can be 
displayed simultaneously and identified, whereby the range of 
the work-processing to be carried out can be accurately and 
easily determined. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 
result of the analysis, the entire information analysis 
operation can be supported. 

Further, the document memory unit also stores all or part 
of the document data work-processed by the work processing unit . 

Since other data can be handled simultaneously, when 
thereafter analyzing documents according to their meanings, 
rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the selection unit further selects all or part 
of the document data output by the output unit. 

Since all or part of the document data output by the output 
unit can be selected for analysis, a wide variety of information 
can be analyzed with high precision . Therefore, when analyzing 
documents according to their meanings, rather than merely 
outputting the result of the analysis, the entire information 



analysis operation can be supported. 

Further, the document memory unit further stores data 
relating to contents of the work processing. 

Hence not only can loss of data relating to the contents 
of work-processing can be prevented and the data managed easily, 
but also the relationship between settings used in the 
work-processing and the processed result can be determined. 
Therefore, when analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

A document classification device for classifying 
documents based on contents thereof according to another aspect 
of the present invention comprises an input unit for inputting 
document data; a language analyzer unit for analyzing document 
data input by the input unit and obtaining language analysis 
information; a vector creation unit for document characteristic 
vectors for the document data based on the language analysis 
information obtained by the language analyzer unit; a 
classification unit for classifying documents based on the 
degree of similarity between document characteristic vectors 
created by the vector creation unit, and creating clusters of 
documents; a cluster characteristics calculation unit for 
calculating cluster characteristics, which are 

characteristics of clusters of documents created by the 
classification unit; and a classification category memory unit 



for storing cluster characteristics, calculated by the cluster 
characteristics calculation unit, as constituent elements of 
classification categories . 

According to the above aspect of this invention, it is 
possible to obtain clusters, and to structure and categorize 
the clusters based on their contents using their degree of 
similarity to the cluster center, and the like. 

A document classification device for classifying 
documents based on contents thereof according to still another 
aspect of the present invention comprises an input unit for 
inputting document data; a language analyzer unit for analyzing 
document data input by the input unit and obtaining language 
analysis information; a vector creation unit for creating 
document characteristic vectors for the document data based on 
the language analysis information obtained by the language 
analyzer unit; a classification unit for classifying documents 
based on the degree of similarity between document 
characteristic vectors created by the vector creation unit, and 
creating clusters of documents; a cluster characteristics 
calculation unit for calculating cluster characteristics, 
which are characteristics of clusters of documents created by 
the classification unit; a display unit for displaying the 
cluster characteristics calculated by the cluster 
characteristics calculation unit; a cluster selection 
specification unit for selecting predetermined clusters from 



cluster of documents created by the classification unit; and 
a classification category memory unit for storing cluster 
characteristics, calculated by the cluster characteristics 
calculation unit, as constituent elements of classification 
categories . 

According to the above aspect of this invention, only 
selected clusters are used, making it possible to structure and 
categorize to clusters in a manner closer to that desired by 
the operator. 

Further, the arrangement of the present invention 
described above further comprises a document characteristic 
vector memory unit for storing document characteristic vectors 
created by vector creation unit; and a vector correction unit 
for correcting document characteristic vectors stored in the 
document characteristic vector memory unit, so that document 
characteristic vectors of documents belonging to clusters 
selected by the cluster selection unit are deleted. 
Furthermore, the classification unit classifies documents 
based on the document characteristic vectors corrected by the 
vector correction unit. 

Hence the effects of clusters which are already known can 
be eliminated, and new clusters can be created. 

Further, the document classification device of the 
present invention further comprises a document characteristic 
vector memory unit for storing document characteristic vectors 



created by vector creation unit; and a document expression space 
correction unit for correcting document expression space when 
determining the degree of similarity between document 
characteristic vectors stored in the document characteristic 
5 vectors memory unit, based on a characteristics amount 
calculated from clusters selected by the cluster selection unit . 
Furthermore, the classification unit classifies documents 
based on the degree of similarity between document 
characteristic vectors created by the vector creation unit, 

10 using the document expression space corrected by the document 
expression space correction unit. 

Hence, cluster characteristics selected by the operator 
in the previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

15 Further, the document classification device according to 

the present invention further comprises a document 
characteristic vector memory unit for storing document 
characteristic vectors created by vector creation unit; and a 
document expression space correction unit for correcting 

20 document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
the document characteristic vectors memory unit, based on a 
characteristics amount calculated from clusters selected by the 
cluster selection unit. Furthermore, the classification unit 

25 classifies documents based on the degree of similarity between 
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document characteristic vectors created by the vector creation 
unit, using the document expression space corrected by the 
document expression space correction unit. 

Hence influences of the known cluster can be eliminated 
5 and cluster characteristics selected by the operator in the 
previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

Further, the document classification device of the 
present invention further comprises a selection information 

10 appending unit for appending selection information showing the 
fact of selection when all or part of the documents belonging 
to a cluster of documents created by the classification unit 
have been selected. Furthermore, the display unit displays the 
cluster characteristics, and also displays the selection 

15 information appended by the selection information appending 
unit . 

Hence it is possible to improve the ability to identify 
documents used on multiple occasions, and the ability to 
identify documents which have not been selected at all. 

20 Further, the classification category memory unit stores 

cluster characteristics and/or information created by an 
operator, in addition to all or part of the documents belonging 
to a cluster of documents selected by the selection 
specification unit, as constituent elements of classification 

25 categories. 
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Hence the contents of clusters can be easily recognized, 
and in addition, the operator can easily create his own 
classification categories, thereby improving the usefulness of 
the classification categories. 

A document classification device for classifying 
document clusters in accordance with contents thereof according 
to still another aspect of the present invention comprises a 
document input unit for inputting document data groups; a 
document dividing unit for dividing document data into one or 
multiple divided document data based on a predetermined 
reference; a document-divided document map creation unit for 
creating a map showing the correspondence between the document 
data and the divided document data; a divided document 
classification unit for classifying the divided document data; 
a divided document classification result creation unit for 
creating divided document classification result information 
based on a classification result of the divided document 
classification unit; and a document classification result 
creation unit for creating classification result information 
of the above document data using the document-divided document 
map and the divided document classification result information. 

According to the above aspect of this invention, when one 
document contains multiple topics and meanings, these can be 
classified into categories according to specific topics and 
meanings, so that the classifications do not differ from 



categories desired by a user, thereby enabling the user to 
easily comprehend the classification categories . Furthermore, 
since the positions of the divided documents in documents prior 
to division (documents belonging to the clusters) is displayed, 
the user is able to efficiently read the parts of the document 
clusters he or she wishes to read. 

Further, the document classification device further 
comprises a document save unit for saving the document data; 
a divided document save unit for saving the divided document 
data; and a document-divided document map save unit for saving 
a document-divided document map created by the document-divided 
document map creation unit. 

Hence for a single document data, it is possible to 
efficiently determine classification results having different 
parameters such as the number of classifications, the 
classification method, and the settings used in the 
classifications, without recreating the divided document data 
and the document-divided document map. Furthermore, by 
classifying the document data and saving the data needed to 
create the classification result, the user is free to take more 
time over the classification, and to re-analyze previously 
classified documents within a given period of time. 

Further, the document classification device in the 
specific arrangement described above further comprises a 
divided document classification result save unit for saving 



divided document classification result information created by 
the divided document classification result creation unit. 

Hence, an additional effect, such that after one 
classification has been carried out, the result of that 
classification can be expressed in a variety of formats such 
as text, charts, graphs, and the like can be achieved. 
Furthermore, by saving the divided document classification 
result information, the user is free to take more time over 
classifications and analysis of classification results, and to 
re-analyze previously classified documents in a variety of 
formats within a given period of time. 

Further, the multiple divided document data created by 
the document dividing unit contains the document data in its 
state prior to being divided. 

Hence in addition to a classification structure of 
detailed document data, obtained by classifying the divided 
document data, the user can obtain a classification structure 
fusing schematic macro classifications as a result of 
classifying the document data itself prior to division. 

Further, the document dividing unit divides document data 
based on information relating to the structure of the document 
data . 

Hence division and the like of different topics can be 
carried out, whereby documents can be classified in such a 
manner that the detailed classification structures of their 



document data can be known. 

Further, the document classification device further 
comprises a document element extraction unit for extracting 
elements in the document data; an element-accompanying 
information extraction unit for extracting element- 
accompanying information accompanying the elements extracted 
by the document element extraction unit. Furthermore, the 
document dividing unit divides the document data using elements 
extracted by the document element extraction unit, or the 
elements and element-accompanying information extracted by the 
element-accompanying information extraction unit. 

Hence documents can be classified so that the detailed 
classification structure of the document data can be known. 

Further, the document dividing unit divides document data 
in compliance with a specified specification range. 

Hence documents can be classified in accordance with the 
wishes of the user, and so that the detailed classification 
structure of the document data can be known. 

Further, the document dividing unit divides document data 
based on the number of letters, the number of sentences, or both 
the number of letters and the number of sentences. 

Hence there is an increased capability to classify 
different documents having contents of different topics and the 
like. Therefore, as above, documents can be classified so that 
the detailed classification structure of the document data can 
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be known . 

Further, the document classification result creation 
unit extracts and presents information showing document data, 
and representative information accompanying the document data, 
as classification result information. 

Hence the user is able to determine a detailed schematic 
structure or overall structure of the document data. 

Further, the document classification result creation 
unit extracts and presents information showing divided document 
data, and representative information accompanying the divided 
document data, as classification result information. 

Hence the user is able to determine a detailed schematic 
structure or overall structure of the document data. In 
addition, the user can easily determine which divided document 
has been classified in a given category. 

A document processing method according to still another 
aspect of the present invention outputs multiple input document 
data in order to display or print the document data in a 
predetermined format, and comprises the steps of storing input 
document data; selecting all or part of the document data stored 
in the documents memory unit; extracting data relating to 
characteristics of letter rows from all or part of the document 
data selected by the selection unit; work-processing all or part 
of the document data based on the data relating to 
characteristics of letter rows extracted by the characteristics 



extraction unit; and outputting all or part of the document data 
work-processed by the work processing unit. 

According to the above aspect of this invention, when 
analyzing documents according to their meanings, rather than 
merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

Further, the step of outputting comprises the steps of 
setting a plurality of item values based on the contents of all 
or part of the document data work-processed by the work- 
processing unit; and totalizing all or part of the document data 
for each item value set by the item value set unit; and outputs 
all or part of the document data in the format of a table having 
an item value as at least one axis. 

Hence the result of the work-processing can easily be 
expressed in a cross table, and the contents of the information 
can easily be grasped. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 
result of the analysis, the entire information analysis 
operation can be supported. 

Further, the step of outputting further comprises 
outputting all or part of the document data work-processed by 
the work processing unit together with all or part of the 
document data in its state prior to work-processing by the work 
processing unit. 

Hence the data to be work-processed and other data can 



be displayed simultaneously and identified, whereby the range 
of the work-processing to be carried out can be accurately and 
easily determined. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 
5 result of the analysis, the entire information analysis 
operation can be supported. 

Further, the step of storing further comprises storing 
iq all or part of the document data work-processed by the work 

:: g processing unit. 

|y 10 Since other data can be handled simultaneously, when 

*J» thereafter analyzing documents according to their meanings, 

^ rather than merely outputting the result of the analysis, the 

0=f entire information analysis operation can be supported. 

"5: Further, the step of selecting further comprises 

k - 15 selecting all or part of the document data output by the output 
unit . 

Since all or part of the document data output by the output 
unit can be selected for analysis, a wide variety of information 
can be analyzed with high precision. Therefore, when analyzing 
20 documents according to their meanings, rather than merely 
outputting the result of the analysis, the entire information 
analysis operation can be supported. 

Further, the step of storing a document further comprises 
storing data relating to contents of the work processing. 
25 Hence not only can loss of data relating to the contents 



of work-processing can be prevented and the data managed easily, 
but also the relationship between settings used in the 
work-processing and the processed result can be determined. 
Therefore, when analyzing documents according to their meanings, 
5 rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

A document classification method for classifying 
documents based on contents thereof according to still another 
aspect of the present invention comprises the steps of inputting 

10 document data; language-analyzing document data input in the 
step of inputting and obtaining language analysis information; 
creating document characteristic vectors for the document data 
based on the language analysis information obtained in the step 
of language-analyzing; classifying documents based on the 

15 degree of similarity between document characteristic vectors 
created in the step"5f creating vectors, and creating clusters 
of documents; calculating cluster characteristics, being 
characteristics of clusters of documents created in the step 
of classifying; and storing cluster characteristics, 

20 calculated in the step of calculating cluster characteristics, 
as constituent elements of classification categories. 

According to the above aspect of this invention, it is 
possible to obtain clusters, and to structure and categorize 
the clusters based on their contents using their degree of 

25 similarity to the cluster center, and the like. 
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A document classification method for classifying 
documents based on contents thereof according to still another 
aspect of the present invention comprises the steps of inputting 
document data; language-analyzing document data input in the 
step of inputting and obtaining language analysis information; 
creating document characteristic vectors for the document data 
based on the language analysis information obtained in the step 
of language-analyzing; classifying documents based on the 
degree of similarity between document characteristic vectors 
created in the step of creating vectors, and creating clusters' 
of documents; calculating cluster characteristics, which are 
characteristics of clusters of documents created in the step 
of classifying; displaying the cluster characteristics 
calculated in the step of calculating cluster characteristics; 
selecting predetermined clusters from cluster of documents 
created in the step of classifying; and storing cluster 
characteristics, calculated in the step of calculating cluster 
characteristics, as constituent elements of classification 
categories . 

According to the above aspect of this invention, only 
selected clusters are used, making it possible to structure and 
categorize to clusters in a manner closer to that desired by 
the operator. 

Further, the document classification method further 
comprises a step of correcting document characteristic vectors 



stored in the step of storing document characteristic vectors, 
so that document characteristic vectors of documents belonging 
to clusters selected by the step of selecting clusters are 
deleted. Furthermore, the step of classifying comprises 
5 classifying documents based on the document characteristic 
vectors corrected by the step of correcting vectors. 

Hence the effects of clusters which are already known can 
.™ be eliminated, and new clusters can be created. 

1= Further, the document classification method further 

hi] io comprises a step of correcting document expression space when 
,t determining the degree of similarity between document 

characteristic vectors stored in the step of storing document 
1U characteristic vectors, based on a characteristics amount 

N calculated from clusters selected in the step of selecting 

15 clusters, and the step of classifying comprises classifying 
documents based on the degree of similarity between document 
characteristic vectors created in the step of creating vectors, 
using the document expression space corrected in the step of 
correcting the document expression space. 
20 Hence cluster characteristics selected by the operator 

in the previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

Further, the document classification method further 
comprises the steps of correcting document expression space 
25 when determining the degree of similarity between document 
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characteristic vectors stored in the step of storing document 
characteristic vectors, based on a characteristics amount 
calculated from clusters selected in the step of selecting 
clusters. Furthermore, the step of classifying comprises 
classifying documents based on the degree of similarity between 
document characteristic vectors created in the step of creating 
vectors, using the document expression space corrected in the 
step of correcting the document expression space. 

Hence influences of the known cluster can be eliminated 
and cluster characteristics selected by the operator in the 
previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

Further, the document classification method further 
comprises the steps of appending selection information showing 
the fact of selection when all or part of the documents belonging 
to a cluster of documents created in the step of classifying 
have been selected. Furthermore, the step of displaying 
comprises displaying the cluster characteristics, and 
displaying the selection information appended in the step of 
appending selection information. 

Hence it is possible to improve the ability to identify 
documents used on multiple occasions, and the ability to 
identify documents which have not been selected at all. 

Further, the step of creating classification categories 
comprises creating cluster characteristics and/or information 



created by an operator, in addition to all or part of the 
documents belonging to a cluster of documents selected in the 
step of specifying selection, as constituent elements of 
classification categories . 

Hence the contents of clusters can be easily recognized, 
and in addition, the operator can easily create his own 
classification categories, thereby improving the usefulness of 
the classification categories. 

A document classification method for classifying 
document clusters in accordance with contents thereof according 
to still another aspect of the present invention comprises the 
steps of inputting document data groups; dividing document data 
into one or multiple divided document data based on a 
predetermined reference; creating a map showing the 
correspondence between the document data and the divided 
document data; classifying the divided document data; creating 
divided document classification result information based on the 
classification result of classifying the divided documents; and 
creating classification result information of the document data 
using the document-divided document map and the divided 
document classification result information. 

According to the above aspect of this invention, when one 
document contains multiple topics and meanings, these can be 
classified into categories according to specific topics and 
meanings, so that the classifications do not differ from 



categories desired by a user, thereby enabling the user to 
easily comprehend the classification categories . Furthermore, 
since the positions of the divided documents in documents prior 
to division (documents belonging to the clusters) is displayed, 
5 the user is able to efficiently read the parts of the document 
clusters he or she wishes to read. 

A computer-readable recording medium of still another 
aspect of the present invention stores programs for executing 
the above-described document classification method on a 

10 computer, thereby making the program readable mechanically, and 
enabling the operation of the document classification method 
to be executed by a computer. 

Other objects and features of this invention will become 
understood from the following description with reference to the 

15 accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a diagram explaining the entire hardware 
constitution of a data processing system comprising a document 
20 processor according to a first embodiment of the present 
invention; 

Fig. 2 is a diagram explaining the hardware constitution 
of a server in a data processing system comprising the document 
processor according to the first embodiment of the present 
25 invention; 
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Fig. 3 is a diagram explaining the hardware constitution 
of a client in a data processing system comprising the document 
processor according to the first embodiment of the present 
invention; 

5 Fig. 4 is a block diagram functionally showing a 

constitution of the document processor according to the first 
embodiment of the present invention; 

Fig. 5 is a diagram explaining the relationship between 
item names of the document processor according to the first 
10 embodiment of the present invention; 

Fig. 6 is a diagram explaining a data structure of a 
document stored in a document memory of the document processor 
according to the first embodiment of the present invention; 

Fig. 7 is a diagram explaining another data structure of 
15 a document stored in a document memory of the document processor 
according to the first embodiment of the present invention; 

Fig. 8 is a diagram explaining an example of a screen 
display in an output section of a document processor according 
to an embodiment of the present invention; 
20 Fig. 9 is a diagram explaining another example of a screen 

display of an output section of a document processor according 
to an embodiment of the present invention; 

Fig. 10 is a diagram explaining another example of a 
screen display of an output section of a document processor 
25 according to an embodiment of the present invention; 
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Fig. 11 is a diagram explaining a list of contents of 
extraction processing performed by a characteristics extractor 
of a document processor according to the first embodiment of 
the present invention; 
5 Fig. 12 is a diagram explaining a list of contents of work 

processing performed by a work processor of a document processor 
according to the first embodiment of the present invention; 

Fig. 13 is a diagram explaining characteristic vectors 
of each item of a document processor according to the first 
10 embodiment of the present invention; 

Fig. 14 is a diagram explaining words, and the number of 
appearances of each word ID, of a document processor according 
to the first embodiment of the present invention; 

Fig. 15 is a diagram explaining another screen display 
15 of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 16 is a diagram explaining a command screen for 
creating a cross table in an output section of a document 
processor according to the first embodiment of the present 
20 invention; 

Fig. 17 is a diagram explaining a cross table displaying 
a result of classification processing by an output section of 
a document processor according to the first embodiment of the 
present invention; 
25 Fig. 18 is a diagram explaining another cross table 
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displaying a result of classification processing by an output 
section of a document processor according to the first 
embodiment of the present invention; 

Fig . 19 is a block diagram showing a detailed constitution 
5 of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 20 is a flowchart showing an output sequence of a 
cross table of a document processor according to the first 
embodiment of the present invention; 
10 Fig. 21 is a diagram explaining another screen display 

of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 22 is a diagram explaining another screen display 
of an output section of a document processor according to the 
15 first embodiment of the present invention; 

Fig. 23 is a diagram explaining another screen display 
of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 24 is a diagram explaining another screen display 
20 of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 25 is a block diagram showing a detailed constitution 
of document memory of a document processor according to the 
first embodiment of the present invention; 
25 Fig. 26 is a diagram explaining another screen display 
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of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 27 is a diagram explaining another screen display 
of an output section of a document processor according to the 
5 first embodiment of the present invention; 

Fig. 28 is a diagram explaining another screen display 
of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 29 is flowchart showing a sequence of document 
10 processing in a document processor according to the first 
embodiment of the present invention; 

Fig. 30 is a block diagram functionally showing a 
constitution of a document classification device according to 
a second embodiment of the present invention; 
15 Fig. 31 is a diagram explaining an example of a display 

of a cluster characteristics display section in a document 
classification device according to the second embodiment of the 
present invention; 

Fig. 32 is a flowchart showing a sequence of processing 
20 in a document classification device according to the second 
embodiment of the present invention; 

Fig. 33 is a block diagram functionally showing a 
constitution of a document classification device according to 
a third embodiment of the present invention; 
25 Fig. 34 is a flowchart showing a sequence of processing 
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in a document classification device according to the third 

embodiment of the present invention; 

Fig. 35 is a block diagram functionally showing a 

constitution of a document classification device according to 
5 a fourth embodiment of the present invention; 

Fig. 36 is a flowchart showing a sequence of processing 

in a document classification device according to the fourth 

embodiment of the present invention; 

Fig. 37 is a block diagram functionally showing a 
10 constitution of a document classification device according to 

a fifth embodiment of the present invention; 

Fig. 38 is a flowchart showing a sequence of processing 

in a document classification device according to the fifth 

embodiment of the present invention; 
15 Fig. 39 is a block diagram functionally showing a 

constitution of a document classification device according to 

a sixth embodiment of the present invention; 

Fig. 40 is a diagram explaining a table provided in a 

classification result memory of a document classification 
20 device according to the sixth embodiment of the present 

inventions- 
Fig. 41 is a flowchart showing a processing sequence of 

a selection information append section of a document 

classification device according to the sixth embodiment of the 
25 present invention; 
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Fig. 42 is a block diagram showing a constitution of a 
document classification device according to a seventh 
embodiment of the present invention; 

Fig. 43 is a diagram explaining a document classification 
device and a document classification method according to the 
seventh embodiment of the present invention; 

Fig. 4 4 is another diagram explaining a document 
classification device and a document classification method 
according to the seventh embodiment of the present invention; 

Fig. 45 is another diagram explaining a document 
classification device and a document classification method 
according to the seventh embodiment of the present invention; 

Fig. 4 6 is another diagram explaining a document 
classification device and a document classification method 
according to the seventh embodiment of the present invention; 

Fig. 47 is a block diagram showing a constitution of a 
document classification device according to an eighth 
embodiment of the present invention; 

Fig. 48 is a block diagram showing a constitution of a 
document classification device according to a ninth embodiment 
of the present invention; 

Fig. 49 is a diagram explaining a document classification 
device and a document classification method according to a tenth 
embodiment of the present invention; 

Fig. 50 is a diagram explaining a document classification 
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device and a document classification method according to an 
eleventh embodiment of the present invention; 

Fig. 51 is a block diagram showing a constitution of a 
document classification device according to a twelfth 
5 embodiment of the present invention; 

Fig. 52 is a diagram explaining a document classification 
device and a document classification method according to the 
twelfth embodiment of the present invention; 

Fig. 53 is a diagram explaining a document classification 
10 device and a document classification method according to a 
thirteenth embodiment of the present invention; 

Fig. 54 is a diagram explaining a document classification 
device and a document classification method according to a 
fourteenth embodiment of the present invention; 
15 Fig. 55 is a diagram explaining a document classification 

device and a document classification method according to a 
fifteenth embodiment of the present invention; and 

Fig. 56 is a diagram explaining a document classification 
device and a document classification method according to a 
20 sixteenth embodiment of the present invention. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Preferred embodiments of a document processor, a document 
processing method, and a computer-readable recording medium for 
25 recording a program to execute the method on a computer 
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according to the present invention will be described below with 
reference to the accompanying drawings. 

To begin with, the hardware constitution of an entire data 
processing system comprising a document processor according to 
5 a first embodiment of the present invention will be explained. 
Fig. 1 is a diagram explaining the hardware constitution of an 
entire data processing system comprising a document processor 
according to the first embodiment of the present invention. 

As shown in Fig. 1, a data processing system comprising 

10 the document processor according to the first embodiment 
comprises a server /client system. That is, a server 101 and 
multiple clients 102 are connected via a. network 103.. The 
clients 102 create work data such as classification data, send 
this to the server 101, and display the results of work 

15 processing such as classification, and the like. On the other 
hand, in compliance with specifications from the clients 102, 
the server 101 carries out vast numerical calculations to 
perform work processing such as document (text) classification, 
and sends the results of the processing to the clients 102. 

20 More specifically, when performing classification 

processing, the server 101 classifies a text (pre-processing, 
clustering) and the clients 102 create classification data, 
program execution commands, tables of text classification 
result, and such like. As described above, the processing at 

25 the server 101 is divided into two types, "pre-processing" and 
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"classif ication" , and the burden of this processing can be 
extremely heavy when there is a vast amount of data. Therefore, 

m 

a manager process creates a processing receive list and controls 
the processing, so that "pre-processing" and "classification" 
5 in the server 101 are only performed once each. 

Furthermore, data is exchanged between the server 101 and 
the clients 102 by a method termed joint filing. That is, a 
q file used in processing such as classification is created in 

j£ a joint folder on the server 101, enabling both sides to exchange 

'Si 

I1J 10 the data. Therefore, the clients 102 can use the joint folder 
«p of the server 101 via the joint network. 

B! - The constitution of the hardware of the server 101 and 

W the clients 102 will be explained below. Fig. 2 is a diagram 

"2 explaining a hardware constitution of the server 101 in the data 

^ 15 processing system comprising the document processor according 
to the first embodiment. A work station (WS) is, for example, 
used as the server 101. 

In Fig. 2, reference symbol 201 represents a CPU for 
controlling the entire server 101, reference symbol 202 
20 represents a ROM which stores boot programs and the like, 
reference symbol 203 represents a RAM used as work area of the 
CPU 201, reference symbol 204 represents an interface (I/F), 
which is connected to the network 103 via a communications line 
205 and controls the network 103 and an internal interface, and 
25 reference symbol 206 represents a disk device for storing data. 
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a The reference symbol 200 represents a bus for coupling the above 

parts . 

B " In addition, a display 208 for displaying document 

information, image information, function information, and the 
5 like, a keyboard 209 for inputting data, and a mouse 210 and 
the like, may similarly be connected. Moreover, the disk device 
206 comprises a joint folder 207 for exchanging data with the 
clients 102 . 

J= Furthermore, Fig. 3 is a diagram explaining a hardware 

jfy 10 constitution of a client 102 in a data processing system 
4= comprising the document processor according to the first 

embodiment. A personal computer (PC) is, for example, used as 
iy the client 102. 

N In Fig. 3, reference symbol 301 represents a CPU for 

:i D 15 controlling the entire system, reference symbol 302 represents 
a ROM which stores boot programs and the like, reference symbol 
303 represents a RAM used as a work area of the CPU 301, reference 
symbol 304 represents an HDD (hard disk drive) for controlling 
reading and writing of data to an HD (hard disk) 305 in compliance 
20 with the CPU 301, reference symbol 305 represents an HD for 
storing data written in compliance with the HDD 304, reference 
symbol 306 represents an FDD (floppy disk drive) for controlling 
reading and writing of data to an FD (floppy disk) 307 in 
compliance with the CPU 301, reference symbol 307 represents 
25 a freely attachable and detachable FD for storing data written 



in compliance with the FDD 306, and reference symbol 308 
represents a display for displaying documents, images, function 
data, etc. 

Furthermore, reference symbol 309 represents an 
5 interface (I/F), which is connected to the network 103 via a 
communications line 310 and controls the network 103 and the 
internal interface, reference symbol 311 represents a keyboard 
comprising keys for inputting letters, numbers, a variety of 
commands, and the like, reference symbol 312 represents a mouse 

10 for moving a cursor and selecting a range, or pressing icons 
and buttons displayed on a display screen, moving windows and 
changing their sizes, and the like, reference symbol 313 
represents a scanner for optically reading images having an OCR 
(optical character reader) function, reference symbol 314 

15 represents a printer for printing contents and the like of data 
comprising classification results, and reference symbol 315 
represents a bus for joining all the above parts. Furthermore, 
an application software 316 such as a word processing software 
is stored in the HD 305. 

20 Functional constitution of the document processor 

according to the first embodiment will be explained here. Fig. 
4 is a block diagram functionally showing a constitution of the 
document processor according to the first embodiment of the 
present invention . In Fig. 4, the document processor comprises 

25 an input section 401, a document memory 402, a selector 403, 

38 




i 

a characteristics extractor 404, a work processor 405, and an 
output section 406. 

The input section 401, the document memory 402, the 
selector 403, the characteristics extractor 404, the work 
5 processor 405, and the output section 406, are controlled by 
CPU 201 and CPU 301 and the like, which operate processing in 
compliance with commands contained in programs recorded in 
recording media such as a ROM 202 and 302, a RAM 203 and 303, 
or a disk device 306 and a hard disk 316, etc. 

10 The input section 401 inputs document data, and for 

example comprises the I/F 204 or 309, or the like, capable of 
obtaining documents and groups of documents via a keyboard 209 
or 311, a scanner 313 comprising an OCR function, and a network 
103. Furthermore, in addition to the above, if the input 

15 section 401 is capable of extracting document data, it comprises 
all the above parts. For example, when the document data is 
saved in a data base, and the medium in which the data base is 
stored is provided in the document processor of the first 
embodiment, document data is input. 

20 A document is a collection of one or more sentences 

written in a natural language, comprising letters, rows of 
letters, numbers, and the like, which are organized into a 
meaningful arrangement to form one document. Furthermore, a 
collection of multiple documents' is termed a document cluster. 

25 A document comprises one or multiple items. An item 
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comprises item name and item value. An item name is a label 
showing the contents of the item, and may or may not be included 
in the document. An item value is the actual content of the 
item. Fig. 5 is a diagram explaining the relationship between 
an item name and an item value in the document processor 
according to the first embodiment. Fig. 5 shows an example in 
which one patent specification forms one document, and the 
patent specification is expressed using an item name and an item 
value . 

A unique document ID is appended to each document and each 
document in the document clusters obtained by the input section 
401, and these are stored in the document memory 402. Fig. 6 
is a diagram explaining the structure of document data stored 
in the document memory 402 of the document processor according 
to the first embodiment. Each of the item names and item values 
are saved in one memory unit, that is, in one cell of the document 
memory 4 02 . 

In Fig. 6, one cell comprises three memory regions, and 
the position (number) of the next cell in the document memory 
402 is stored in the first memory region 601. The generic value 
of the cell is stored in the second memory region 602. 

The generic values of the cells can, for example, be set 
such that "0" signifies "empty", "1" signifies "numerical 
value", and "2" signifies a letter row. . . . The content of the 
cell, that is, the head position of the region which the item 



name or the item value and the like are stored in, is stored 
in the third memory region 603. 

Rearrangement of the cell sequence, and addition and 
deletion of cells, can easily be performed by changing the 
5 position of the next cell stored in the first memory region 601. 
Furthermore, since the actual content of the cell is stored in 
a different region in the cell structure, when an item has been 
updated and can no longer be held in a region reserved in advance, 
for example, it is only necessary to reserve another large 

10 region in which to store the item, with no effect on the structure 
of the cell itself, and to update the head position of the third 
memory region 603 stored third. 

Fig. 7 is a diagram explaining another data structure of 
a document stored in the document memory 4 02 of the document 

15 processor according to the first embodiment. In Fig. 7, one 
cell uses two memory regions. The generic value of the cell 
is stored in a first memory region 701 . The content of the cell, 
that is, the head position of the region which the item name 
or the item value and the like are stored in, is stored in a 

20 second memory region 702. 

The next cell is stored in the next memory region adjacent 
in the document memory 402. With this data structure, a 
movement operation within the memory is required when cells have 
been rearranged, added, or deleted. 

25 The document memory 402 comprises a semiconductor memory 
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for handling data usually at high-speed, but may include an 
auxiliary memory device comprising a magnetic disk, an optical 
disk, or the like. 

Documents and document clusters stored in the document 
5 memory 402 are displayed by the output section 406 . In the first 
embodiment, the output section 406 comprises a CRT display, a 
liquid crystal display, or the like. The output section 406 
ii^ reads out the contents of documents and document clusters stored 

J* in the document memory 402 in the cell sequence, and displays 

iro 10 or prints them in table format. 

Jp Furthermore, the output section 406 may also comprise a 

* - graph drawer 407 for drawing graph based on the data displayed 

[U or printed in table format. The graph drawer 407 reads out 

if sj 

^ contents of a region set by the user with respect to item values 

] ^ 15 of a document or a cluster of documents stored in the document 
memory 402, draws graph such as bar graphs, pie charts, regular 
line graphs, and the like, and displays and prints them. 

The output section 406 also displays operations of the 
input section 401, by for example displaying operation menus, 
20 mouse pointers, cursor displays, and the like. Furthermore, 
the output section 406 may also comprise a printing device such 
as a printer for printing the results of processing. 

In compliance with a command input by the operator to the 
input section 401, the selector 403 reads out data in a region 
25 selected by the display of the output section 406 from the 



document memory 402, and sends it to the characteristics 
extractor 404. The method by which the selector 403 makes its 
selection will be explained using Figs. 8 to 10. 

Figs. 8 to 10 are diagrams explaining examples of screen 
5 displays of the output section 406 of the document processor 
according to the first embodiment. More specifically, the 
diagrams show examples of screen displays listing types of 
vehicle malfunctions. In Fig. 8, the display screen displays 
a "numbers" column 801 showing document ID numbers, a "date 

10 received" column 802 showing the date on which the malfunction 
information was received, a "sales office" column 803 showing 
the sales office where the malfunction information was received, 
a "vehicle type" column 804 showing the type of vehicle to which 
the information refers, a "year" column 805 showing the year 

15 of the vehicle to which the information refers, and a "contents" 
column 806 showing the content of the malfunction information. 

In Fig. 9, a selected region 901 is the portion displayed 
within the rectangle and altered in color. Similarly, in Fig. 
10, the selected region 1001 is the portion displayed within 

20 the rectangle and altered in color. 

The region selected by the selector 403 may be one part 
of a column displayed on the screen as shown in Fig. 9, or, when 
an item name is selected as shown in Fig. 10, all the item value 
belonging to that item name may be selected. In the first 

25 embodiment, only regions belonging to letter rows can be 

43 



selected. 

Next, the process of extraction performed by the 
characteristics extractor 404 will be explained. An item value 
is selected by the selector 403, and the characteristics of the 
5 item name are extracted by the characteristics extractor 404. 
Fig. 11 is a diagram explaining a list of contents of extraction 
processing performed by the characteristics extractor 404 of 
the document processor according to the first embodiment. 

In Fig. 11, extraction includes extracting a word 
10 contained in a letter row, the number of words, the number of 
letters in the word, the number of appearances of that word, 
etc. These are extracted using a natural language processing 
technique such as format element analysis or syntax analysis, 
generally used in devices such as a regulatory audio synthesizer 
15 device or an automated translation device. 

Next, work processing performed by the work processor 405 
will be explained. The work processor 405 processes the amount 
of characteristics extracted by the characteristics extractor 
404. Fig. 12 is a diagram explaining a list of contents of work 
20 processing performed by the work processor 405 of the document 
processor according to the first embodiment. 

Work processing comprises processing such as 
"classif ication" for classifying identical characteristics, 
"retrieval" for retrieving a predetermined amount of 
25 characteristics, "rearranging" for rearranging contents of the 



characteristics amount, "representative extraction" for 
extracting a representative value of an amount of 
characteristics, "maximum value extraction" for extracting a 
maximum value from an amount of characteristics, "minimum value 
5 extraction" for extracting a minimum value from an amount of 
characteristics, "calculation" for calculating an amount of 
characteristics, and such like. 

The operator can select his own combination of the 
contents of characteristics extracted by the characteristics 
10 extractor 404, and extracted characteristics processed by the 
work processor 405. Furthermore, it is possible to preset 
highly-efficient combinations, and supply these to the 
operator . 

The result of the processing carried out by the work 
15 processor 405 is saved in a work-processing result saving 
section 408 in the work processor 405. The processed result 
saved in the work-processing result saving section 408 is output 
from the output section 406. The output section 406 reads out 
the contents of the work-processing result saving section 408, 
20 and displays or prints them. 

Here, an example will be explained in which the number 
of appearances of a word contained in the item value is selected 
as the (amount of) characteristics extracted by the 
characteristics extractor 404, and classification is selected 
25 as the work-processing to be carried out by the work processor 
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405 . 

In general, when there are two documents, and the 
incidence of appearance of words comprising the two documents 
are equal, it can be assumed that the meanings of the two 
5 documents are similar to each other. That is, the number of 
appearances of a word in a document is a characteristic having 
a profound relationship to the meaning of the document. 
q Therefore, it can be envisaged that when multiple documents have 

jE: been classified using the number of appearances of a word 

fy 10 therein as a characteristic, the relevant documents having a 
*P meaning close to the classification categories will. 

* - The analyzer 409 in the characteristics extractor 404 

Hf performs natural language analysis, such as format element 

^ analysis, to each of one or multiple item values selected by 

^ 15 the selector 403, and divides them into words. Furthermore, 
information representing the part of speech of each word is 
appended thereto. Of the words appearing, a unique word ID is 
appended to those that are nouns, and the number of appearances 
of each word ID is counted for one item value, and for all item 
20 values selected by the selector 403. 

The characteristics extractor 404 comprises a 
characteristic vector creator 410, which creates an item value 
characteristic vector showing the (amount of) characteristics 
of individual item values based on the number of appearances 
25 counted. For example, Fig. 13 shows the characteristic vectors 
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for each item value when the item values selected by the selector 
403 are: 

"Large noise pollution" 
"Vehicle paint changes color" 
5 "Overheat occurs" 

"Paint is peeling" 
"Battery is dead" 
q "Black exhaust fumes" 

j? Furthermore, Fig. 14 shows the words and the number of 

fu 10 appearances of each of the word IDs those words, 
■-■p Hence, the following characteristic vectors were 

* " obtained: 

j}j "Large noise pollution" : {1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

i"U 

^ 0} 

' fl 15 "Vehicle paint changes color" : {0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 
0, 0, 0} 

"Overheat occurs" : {0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0} 
"Paint is peeling" : {0," 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0} 
"Battery is dead" : {0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0} 
20 "Black exhaust fumes" : {0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
1} - 

The characteristic vectors of these item values are 
output from the characteristics extractor 404 and sent to the 
work processor 405. The work processor 405 classifies the 
25 documents using the characteristic vectors of the item values. 
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Firstly, the distances between the individual vectors are 
calculated. For example, the distances can be measured using 
their inner products. 

After the distance have been calculated, the vectors with 
5 the nearest distances are gathered together. For example, a 
K-means method is used to classify a group of vectors into K 
numbers of vector groups in correspondence with the distances 
thereof. When the vectors have been classified, the work 
p processor 405 appends numbers showing which classification the 

jjlj 10 vectors belong with respect to their item values, that is, 
cluster numbers, and document IDs corresponding to the item 
ej - values, and sends the result to the output section 406, where 

IlJ they are displayed. 

Fig. 15 shows an example of a screen display of cluster 
J3 15 number 1501. Documents which have the same cluster number (for 
example, documents "1" and "6", both have the cluster number 
M 5") belong to the same classification group. 

Next, an arrangement of a second aspect of the present 
invention in which a cross table is output will be explained. 
20 After the input section 401 has read out a cluster of documents 
to be analyzed, the operator inputs commands indicating the 
names of items to be classified, the names of items which will 
form the vertical or horizontal axis of the cross table, and 
the number of classifications. 
25 Fig. 16 shows a command screen for creating a cross table . 
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In Fig. 16, the command screen 1600 comprises a process item 
name input column 1601, an axis item name column 1602, a vertical 
axis command button 1603, a horizontal axis command button 1604, 
and a classification number input column 1605. 
5 The name of the item to be processed 1601 is input to the 

process item name input column 1601. The item name can be input 
from the keyboard 209 or the like, or by using the mouse 210 
or the like to select an item from available items being 
displayed. Furthermore, the name of the item to be the vertical 
10 axis is input to the axis item name column 1602. This can be 
input by the same method as to the process item name input column 
1601 . 

The vertical axis command button 1603 and the horizontal 
axis command button 1604 are for specifying commands to display 
15 an item name to be an axis on the vertical axis or the horizontal 
axis. Furthermore, the number of classifications is input to 
the classification number input column 1605. The number of 
classifications can be input from the keyboard 209 or the like, 
or by using the mouse 210 or the like to select an item from 
20 available items being displayed. 

In Fig. 16, "contents'' is input to the process item name 
input column 1601, "vehicle type" is input to the axis item name 
column 1602, the horizontal axis command button 1604 is marked, 
and "50" is input to the classification number input column 1605 . 
25 This indicates that a command has been given to classify the 
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document cluster into "50" classifications based on the 
"contents" of the document cluster, and to display "vehicle 
type" along the horizontal axis of the cross table . 

Following a command to create the cross table, 
5 classification is carried out, and the classification result 
is displayed in the cross table. Figs. 17 and 18 are diagrams 
showing cross charts displaying classification results. In 
™ the cross table 1700 of Fig. 17, the vertical axis displays 

p "cluster 1", "cluster 2", showing classifications , and the 

fy 10 horizontal axis displays "ABC1600", "ABC1800", showing 
jg vehicle types. 

: i; - The vertical axis of the table, that is, the lines, 

ly correspond to clusters created by classification. The first 

Sj column of each line contains letter rows showing values 

C= 15 determined at the end of classification as preset cluster 
numbers. The horizontal axis of the table , that is, the columns, 
display non-duplicating letter rows contained in the item 
"vehicle name" of the document cluster. Each cell of the line 
"cluster 1" displays the number of the documents classified into 
20 cluster 1 in which the value of the item "vehicle type" matches 
the vehicle type in that column. 

Here, instead of displaying numbers, the size of numbers 
to display the color intensity of the cell, or the area needed 
to paint the cell, need only be expressed. Furthermore, the 
25 columns on the far right and far left of the table show the totals 



of lines and columns. 

In Fig. 18, by moving a mouse pointer 1800 to a cell of 
the cross table 1700, pressing the mouse button of the mouse 
210, or moving the cursor by operating a cursor key on the 
5 keyboard 209, and pressing a specific key, so that the content 
display screen 1801 near that cell is displayed, the item 
"contents" of the corresponding document are displayed. 

The content display screen 1801 displays the number of 
data in the cell, the display items, cell information, and 

10 contents of the display items in the data. The cell specified 
by the mouse pointer 1800 displays a data number: "4", display 
item: "contents", cell information: "ABC2000-cluster 1", and 
four contents as "contents" of the display items: "exhaust is 
black, exhaust is black, .*..". Consequently, the contents of 

15 a cell can be identified simply by moving the mouse pointer to 
the desired cell and pressing the mouse button. 

Furthermore, the items displayed in the content display 
screen 1801 can be updated by resetting, all the items can be 
displayed, and items can be selectively displayed. 

20 The first column of each line contains letter rows showing 

values determined at the end of classification as preset cluster 
numbers. This column can be rewritten by the operator. For 
example, after confirming the contents of a cell by the 
operation described above, "cluster 1" can be rewritten as 

25 "exhaust problems." As a consequence, it is easier to grasp 
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the content of the information. 

Furthermore, instead of inserting a letter row showing 
the value determined at the end of classification as a preset 
cluster number, it is possible to extract a letter row showing 
5 the characteristics of the cluster, and insert this into the 
cell. For example, this can be achieved by extracting the 
phrases and words which appear most frequently from the item 
"contents" of the document contained in cluster 1. 

In Fig. 18, words such as "exhaust is black" or "exhaust" 
10 are entered into the cluster 1. Thus, by a simple operation, 
the operator is able to learn not only the distribution of the 
entire document, but also, where necessary, the detailed 
contents of individual documents. 

Next, the constitution of the output section 406 for 
15 creating a cross table will be explained in detail. Fig. 19 
is a block diagram showing a constitution of the output section 
406 of the document processor according to the first embodiment. 
The output section 406 comprises an item value selector 1901, 
and a totalizer 1902, in addition to the graph-drawing section 
20 407. Moreover, the totalizer 1902 comprises a table saving 
section 1903. having a memory region in correspondence with 
contents which are actually displayed. 

In compliance with an item name (axial item name) 
specified by the operator as one axis of the cross table, the 
25 item value selector 1901 sequentially reads out item values from 
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document data stored in the document memory 402, and gathers 
item values which are not duplicated. Furthermore, the 
totalizer 1902 totalizes the document by adding a numerical 
value to the region corresponding to the item value of the table 
saving section 1903. 

Next, the output sequence of a cross table will be 
explained. Fig. 20 is a flowchart showing an output sequence 
of a cross table of the document processor according to the first 
embodiment. In the flowchart of Fig. 20, the contents of the 
table saving section 1903 are initialized (Step S2001) prior 
to totalization. 

Next, an item value produced by the item value selector 
1901 is allocated to a portion of the table corresponding to 
the item value label (Step S2002) , and a letter row expressing 
a cluster number is allocated to a portion corresponding to the 
cluster number (Step S2003) . 

Next, an item value corresponding to the axial item value 
is determined by referring to documents stored in the document 
memory 402 to find document ID which corresponds with the item 
value saved in the work-processing result saving section 408 
(Step S2004) . Thereafter, 1 is added to the contents of the 
corresponding region in the table saving section 1903 (Step 
S2005) . 

It is then determined whether all the item values have 
been processed (Step S2006) , and if not (NO in the Step S2006) , 



the sequence shifts back to the Step S2004, and the processes 
between the Steps S2004 to S2006 are repeated. 

When it has been determined in the Step S2006 that 
processor has been carried out for all the item values (YES in 
5 the Step S2006) , the total number of lines is calculated to be 
displayed in the far right row (Step S2007) , and simultaneously, 
the total number of columns is calculated to be displayed in 
the bottom line (Step S2008) . 
"p Thereafter, a table formed in the table saving section 

fy 10 1903 is sequentially read out (Step S2009) , whereby all 
:: f: processing ends. 

5! - Data output from the work processor 405 can be sent to 

pj the document memory 402, and stored there with other data in 

%i the document memory 402. Data which have been output from the 

*S 15 work processor 405 and stored in the document memory 402 can 
be displayed by the output section 406 as a new row of the table . 
Furthermore, existing rows of the table can be deleted, and 
replaced by writing the new data. 

In this constitution, the result of the processing, being 
20 the data output from the work processor 405, can be handled on 
an equality with other data which was not processed this time 
in the document memory 402. In subsequent analysis, the data 
can be selected for work processing without needing to 
distinguish whether it was present in the original input data, 
25 or was created by the work processor 405 during analysis. 



Therefore, the data to be work processed and the contents 
of the work processing can be flexibly selected in accordance 
with the type of data, and the contents of the information 
analysis to be performed, enabling a wide variety of information 
5 to be analyzed with high precision. 

Furthermore, it is possible to input to the work processor 
405 not only data output from the characteristics extractor 4 04 , 
but also data selected by the selector 403. Consequently, 
2 additional work processing can be carried out to data whose 

■ t; SjJ 

n] 10 characteristics do not need to be extracted from the letter row, 
: jE and to numerical values of the work processed result, enabling 

5I - an even wider variety of information to be analyzed with high 

fU precision. 

%J Figs. 21 to 24 are diagrams explaining other examples of 

=ifl 15 display screens of the output section 406 of the document 
processor according to the first embodiment. In Fig. 21, a 
"cluster number" 2101 obtained by classification is displayed 
in addition to "number", "date received", "sales office", 
"vehicle type", "year", and "contents". 
20 Moreover, in Fig. 21, the selector 403 has selected 

"cluster number" 2101, and data relating to the "cluster number" 
2101 is displayed in inverse video. When the "cluster number" 
2101 is indicated using a key, the work processor 405 rearranges 
the data. 

25 Fig. 22 shows the result of the rearrangement. In Fig. 



22, documents having a "cluster number" of "1" have been 
collected and displayed. Thereafter, documents having a 
"cluster number" of "2" are collected and displayed. 

More specifically, the documents are rearranged in a 
5 sequence of "numbers" "2", "11", "15", "23", "35", "54", "63", 
"73", and "82", which have a "cluster number" of "1". 

Thereafter, "numbers" "14", "18", "22", "27", "37", , which 

have a "cluster number" of "2", are displayed. 

Next, documents whose items in the "vehicle type" column 
10 belong to "cluster number" of "1" are selected. In Fig. 23, 
the items in the "vehicle type" column which belong to "cluster 
number" of "1" have been selected, and the selected region 2301 
is displayed in inverse video. In this way, since the document 
have already been rearranged according to their "cluster 
15 number", and documents belonging to the same cluster have been 
gathered and displayed, they can easily be selected as a 
continuous region on the screen. 

Next, Fig. 24 shows a bar graph of the incidence of 
generation of the separate vehicle types in the selected region 
20 2301. In Fig. 24, the bar graph display region 2401 displays 
the nine selected documents whose "cluster number" is "1", 
selected in the selection region 2301. These nine documents 
are displayed in the bar graph according to their vehicle type. 

In this way, the documents to be work processed can be 
25 flexibly and easily selected, and various kinds of processes 
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can be carried out thereto. Furthermore, the processed result 
can be processed again in the next processing, enabling 
information to be analyzed at high precision. 

Here, the characteristics of the letter rows which have 
5 been classified or the like are extracted, and are processed 
in a variety of ways after work processing using the 
characteristics. However, a variety of processing may 
alternatively be performed in advance. 

For example, it is possible to select the item "vehicle 

10 type", rearrange the documents using this as a key, and classify 
the collected vehicle types according to, for example, 
"ABC1600". Furthermore, when a document input by the input 
section 401 contains errors such as misspellings, it is possible 
to retrieve the letter row and replace the errors prior to 

15 extracting the characteristics of the classified letter row and 
carrying out work processing using these characteristics, 
thereby adjusting the data to obtain a more accurate result. 

Fig. 25 is a block diagram showing a detailed constitution 
of the document memory 402 of the document processor according 

20 to the first embodiment. In Fig. 25, the document memory 402 
comprises a set value memory 2501, and a set value transceiver 
2502. The set value memory 2501 comprises memories, starting 
with a classification data memory 2503, for storing information 
relating to various set values, that is, set values needed for 

25 operations of the document processor. Consequently, the 



information relating to the set values can be stored together 
with the document information. 

Furthermore, the set value transceiver 2502 transmits 
information relating to the set values stored in the set value 
5 memory 2501 to other information processors. Furthermore, the 
set value transceiver 2502 receives the information relating 
to the set values from other information processors. 
m Information relating to set values is received by the set value 

transceiver 2502, and is stored in the set value memory 2501. 
ipy 10 Stored information relating to set values is read out 

JE simultaneous to the subsequent second reading of the document, 

si' - and is stored in the set value memory 2501. The operator can 

ftj refer to the information relating to the set values by a 

%j predetermined operation, and it can be reused in subsequent 

*5 15 processing. Consequently, the information relating to set 
values can be saved and managed together with the documents, 
thereby preventing loss of the information relating to the set 
value, and enabling appropriate set values to be reused later. 

Figs. 26 to 28 are diagrams explaining other examples of 
20 screen displays of the output section 406 of the document 
processor according to the first embodiment. In Fig. 26, 
firstly, the operator selects the "contents" to be classified 
on the display screen. Consequently, the selected region 2601 
is displayed in inverse video. Next, when the classification 
25 button 2603 is selected from a menu bar 2603, a question screen 



2604 appears asking the number of classifications required, 
that is, the number of documents to be classified. 

When the operator enters the number of classifications 
into the question screen 2604, information relating to the 
number of classifications is stored in the document memory 402. 
In Fig. 26, "50" is input as the number of classifications. 

Thereafter, when the operator completes the analysis of 
the information, and presses a save button (not shown in the 
diagram) which pops up on the screen after selecting the file 
button 2605 of the menu bar 2603, the document memory 402 stores 
the information of the document together with the 
classification result after appending a file name specified by 
the operator. 

In Fig . 27 , when the mouse pointer 2702 is moved to a column 
2701 displaying the classification result, and the mouse button 
is pressed, a classification information display screen 2703 
displays the information relating to classification used in the 
classification, and information relating to the classification 
set value. As a consequence, relevance of the set value used 
can be easily understood. 

The information relating to classification, displayed on 
the classification information display screen 2703, for example 
comprises "classification date" showing information relating 
to the time and date on which classification was carried out, 
"number of documents" showing information relating to the 



number of documents that were classified, etc. Furthermore, 
the information relating to the classification set value 
comprises information such as "classification number" showing 
the number of classified documents, and "classification speech 
5 part" showing which part of speech the classification was based 
on . 

A new table is created for each classification. Fig. 28 
shows a second classification result displayed after 
classification has been carried out a second time after 

10 obtaining the first classification result. When the operator 
wishes to display the first classification result again, he or 
she moves the mouse pointer to the selection region 2801 on the 
label at the bottom left of the screen, and presses the mouse 
button. As a consequence, the first classification result is 

15 displayed again . Thereafter, the second classification result 
can be displayed again by performing the same operation. 

Furthermore, in Fig. 28, information relating to the set 
value used in the classifications are displayed in a 
predetermined display region 2802 of the table. The display 

20 region 2802 does not conceal the classification result display, 
and the position of the display can be moved. Consequently, 
the relationship between the classification result and the set 
value can easily be understood. 

Next, a sequence of document processing of the document 

25 . processor according to the first embodiment will be explained. 
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Fig. 29 is a flowchart showing a document processing sequence 
of the document processor according to the first embodiment. 

In the flowchart of Fig. 29, when starting the process, 
it is determined whether the document data has been input to 
5 the document processor (Step S2901) . Here, the document 
processor waits for the document data to be input, and when the 
document data has been input (YES in Step S2901) , the input 
document data is stored (Step S2902) . The Steps S2901 and S2902 
may be carried out independently of other steps each time 

10 document data is input. 

Next, it is determined whether all or part of the stored 
document data has been selected (Step S2903) . Here, the 
document processor waits for all or part of the document data 
to be selected, and when document data has been selected (YES 

15 in Step S2903) , data relating to letter row characteristics of 
all or part of the stored document data is extracted (Step 

52904) . 

Thereafter, in the Step S2904, predetermined work 
processing, such as classification, is carried out based on the 
20 extracted data relating to the letter row characteristics (Step 

52905) . Following this, data which were work-processed in the 
Step S2905 are output in a table format or the like (Step S2906) . 

Moreover, the data which were work-processed in the Step 
S2905 are stored in correspondence with the original document 
25 data (Step S2907) . Furthermore, data relating to contents of 

61 



the work processing such as the set value of the work processing 
are simultaneously stored (Step S2908). 

Thereafter, it is determined whether all or part of the 
data processed in the Step S2905 has been selected (Step S2908) . 
5 When the data has been selected (YES in the Step S2908), the 
sequence shifts to the Step S2904, and thereafter, the processes 
from the Step S2904 to S2909 are repeated. On the other hand, 
when it is determined that all or part of the data processed 
in the Step S2909 has not been selected (NO in the Step S2909) , 

10 all processing ends. 

The document processing explained in the first embodiment 
can be realized using a program prepared in advance on a computer, 
such as a personal computer or a work station. This program 
is recorded on a computer-readable recording medium such as a 

15 hard disk, a floppy disk, a CD-ROM, an MO, or a DVD, and is 
executed by reading out the program from the recording medium 
using the computer. Furthermore, the program can be 
distributed via the recording medium, or by using a network such 
as the Internet as a transmission medium. 

20 Next, an information classification device according to 

a second to sixth embodiments will be explained. In the second 
to sixth embodiments described below, multiple classifications 
are carried out while varying parameters (number of clusters 
and document clusters to be classified, standards of similarity, 

25 stop words, etc.) for document classification, extraction, and 
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positioning of a topic (content) from one cluster of documents, 
based on the same interpretation as above, namely that a 
document cluster includes a great amount of noise . By providing 
means for saving and integrating the results, it is possible 
5 to gradually determine what kind of contents are contained in 
a given document cluster. 

Since the information processing system comprising the 
document classification device according to the second 
embodiment of the present invention is the same as the first 
10 embodiment shown in Fig . 1, further explanation will be omitted. 
Furthermore, since the hardware constitution of the server 101 
and the clients 102 is the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 
will be omitted . 

15 Next, the functional constitution of a document 

classification device according to the second embodiment will 
be explained. Fig. 30 is a block diagram showing a functional 
constitution of the document classification device according 
to the second embodiment. 

20 As shown in the block diagram of Fig. 30, the document 

classification device comprises an input section 3001, a 
language analyzer 3002 , a vector creator 3003 , a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 

25 cluster characteristics calculator 3008, a classification 
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category memory 3009, a cluster selection specifier 3010, and 
a classification category viewing operator 3011. 

The input section 3001, the language analyzer 3002, the 
vector creator 3003, the classifier 3004, the classification 
5 parameter specifier 30 05 , the classification result memory 3 006, 
the cluster characteristics display 3007, the cluster 
characteristics calculator 3008, the classification category 
memory 3009, the cluster selection specifier 3010, and the 
classification category viewing operator 3011 are controlled 

10 by command processing of a CPU 201, a CPU 301, and the like, 
in compliance with commands written in programs recorded in 
recording media such as a ROM 202, a ROM 302, a RAM 203, a RAM 
303, or a disk device 306, and a hard disk 316. 

Here, the input section 3001 inputs document data, and 

15 for example comprises an I/F 204, or an I/F 309, or the like, 
capable of obtaining documents and groups of documents via 
keyboards 209 or 311, a scanner 313 comprising an OCR function, 
and a network 103. 

Furthermore, in addition to the above, if the input 

20 section 3001 is capable of extracting document data, it may 
comprise all the above parts. For example, when the document 
data is saved in a data base, and the medium in which the data 
base is stored is provided in the document processor of the first 
embodiment, document data is input. 

25 Furthermore, the language analyzer 3002 obtains 



language-analyzed information by analyzing document data input 
by the input section 3001. The vector creator 3003 creates a 
document characteristics vector for the document data, based 
on the language-analyzed information obtained from the language 
analyzer 3002. 

Furthermore, the classifier 3004 classifies documents 
based on the degree of similarity between document 
characteristic vectors created by the vector creator 3003, and 
creates clusters of documents. The classification parameter 
specifier 3005 specifies classification parameters, and for 
example comprises the I/F 204 or 309, or the like, capable of 
obtaining documents and groups of documents via the keyboards 
209 or 311, the mouses 210 or 312, or the network 103. 

Furthermore, the classification result memory 3006 
stores the classification result obtained by the classifier 
3004, that is, information relating to clusters of classified 
documents. Furthermore, the cluster characteristics display 
3007 displays cluster characteristics calculated by the cluster 
characteristics calculator 3008. 

The cluster characteristics calculator 3008 calculates 
cluster characteristics, which are characteristics of document 
clusters created by the classifier 3004. Furthermore, the 
classification category memory 3009 stores the cluster 
characteristics, calculated by the cluster characteristics 
calculator 3008, as constitution elements of classification 
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categories. Furthermore, the classification category memory 
3009 stores clusters of documents, selected by the cluster 
selection specifier 3010, as constitution elements of 
classification categories. That is, it stores all or some of 
5 the documents belonging to clusters selected by the cluster 
selection specifier 3010 as constitution elements of 
classification categories . 

The cluster selection specifier 3010 selects desired 
clusters from among the multiple cluster characteristics 

10 displayed by the cluster characteristics display 3007. 
Furthermore, the cluster selection specifier 3010 selects 
desired clusters of document from among the clusters of 
documents created by the classifier 3004. Furthermore, the 
classification category viewing operator 3011 controls viewing 

15 of data stored in the classification category memory 3009. 

Next, there will be explained an appropriate example in 
which it is important to extract a topic (contents) contained 
in a document cluster, by imagining an analysis of free 
responses collected through a questionnaire or the like. 

20 In recent years, it has become possible to collect 

thousands to tens of thousands of free responses in a short 
period of time via the Internet or the like . Using this function, 
a large amount of textual information can be gathered. 

As an example of a large amount of textual information 

25 collected through a questionnaire or the like, documents 
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containing written answers given in response to the question: 
"Please give an example of wasteful office networking". A 
document cluster is a cluster of single responses. 

Here, the operator (the questionnaire analyzer) may want 
to know a summary of the opinions expressed, that is, what type 
of opinions (topics) are contained in the cluster of opinions 
(document cluster) . To fulfil this requirement, the topic is 
extracted by gathering together (classifying) similar opinions , 
so as to extract information relating to the kind of opinions 
that are contained in the result of the questionnaire. 

Document classification typically comprises the 
following three clearly divided steps. In the first Step, the 
language analyzer 3002 extracts words (or specific continuous 
rows of letters) contained in each of the documents (opinions) 
input by the input section 3001. At this time, for example, 
a language analysis algorithm such as a format element sign is 
used . 

In the second Step, a "word" x "document" matrix is 
created using the extracted words as rows, the documents as 
lines, and the word incidence as components. In addition to 
word ex-traction using language analysis tools having a format 
element analysis function and a syntax analysis function, other 
information such as speech-part information, phrases, and 
syntax information, can be obtained simultaneously, and can be 
considered when creating the above "word" x "document" matrix. 
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Based on the "word" x "document" matrix, the vector 
creator 3003 expresses the documents as vectors in 
multidimensional space comprising words . This is accomplished 
by one of the following methods, all of which are implemented 
5 in the embodiments of the present invention. 

(1) use the row elements of the matrix directly; 

(2) append values representing the importance of the 
Q documents after considering the length of the documents (number 
!: p of letters, number of pages , etc.) and the incidence of the words 
[U 10 in all the classified clusters; 

:: P (3) calculate an inner product matrix between documents 

;f " from the above matrix, and apply specific value analysis (for 

l!f_ example, by using factor analysis or main element analysis, 

~z third-type quantified logic, and the like) , to form dormant 

15 meaningful space. 

Furthermore, it is also possible to use the method 
described in "Representing documents using an explicit model 
of their similarities" (Authors: Brian T. Bartell, Garrison W. 
Cottrell, and Richard K. Belew; Paper Title: Journal of the 
20 American Society for Information Science; Academic Body: The 
American Society for Information Science; Pages: 254-271, Vol. 
46 No. 4; Year of Publication: 1995)", wherein the method for 
converting to dormant meaningful space is generalized, and 
joint reference information and the like, created from 
25 reference information of the document for other documents, is 



appended to the inner product matrix between documents, and this 
matrix is used to lead out expression space conversion 
coefficients for projecting documents and words to space 
reflecting their similarities. 
5 In the third Step, the classifier 3004 classifies the 

documents using the degree of similarity of the document 
characteristic vectors. More specifically, the documents are 
classified by a method such as square contingency, 
discriminatory analysis, or clustering. ■ 

10 Furthermore, the degree of similarity may be measured by 

the inner product, the cosine, the Euclidean distance, the 
Mahalanobis distance, or the like. Any of these methods can 
be used in the present embodiment. 

Furthermore, there are many conventionally known 

15 clustering algorithms. Clustering is generally divided into 
layered clustering and non-layered clustering, but either can 
be used in the present embodiment. 

Furthermore, the classification parameter specifier 3005 
specifies classification parameters to enable the classifier 

20 3004 to classify the document characteristic vectors. The 
classifier 3004 classifies the document characteristic vectors 
it is saving, in compliance with classification parameters 
specified by the classification parameter specifier 3005. 

Thus, when the first document classification, comprising 

25 the processes of the first to third Steps, has ended, the 
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classification result memory 3006 stores the classification 
result. 

Following this, the cluster characteristics calculator 
3008 calculates characteristics showing what kind of clusters 
5 have been obtained in the classification result, that is, it 
calculates cluster characteristics. Typically, it calculates 
the documents, or some of the documents, belonging to each 
cluster, and sorts the documents based on their degree of 
similarity with the center of the cluster. 

10 In addition, numerical values representing standard 

deviation within the cluster, showing the word with the highest 
incidence, the number of documents belonging to the cluster, 
the level of variation of documents within the cluster, are 
calculated to represent cluster characteristics. 

15 The cluster information is calculated in order to inform 

the operator what kinds of clusters (i.e. what kind of 
characteristics they possess ) have been output (displayed), and 
as long as the cluster information shows cluster 
characteristics to the operator, other types of contents 

20 (characteristics) than the above may be used. 

Furthermore, in addition to displaying cluster 
characteristics as above, the cluster characteristics 
calculator 3008 also calculates information representing the 
relationship between clusters. In the case of layered 

25 clustering, the upper or lower cluster is calculated, and in 
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the case of non-layered clustering, adjacent clusters are 
calculated based on their degree of similarity to the cluster 
center . 

Next, the cluster characteristics display of the cluster 
5 characteristics display 3007 and cluster selection will be 
explained. Fig. 31 is a diagram explaining an example of a 
display of the cluster characteristics display 3007 of the 
« document classification device according to the second 

jr embodiment . 

fy 10 In Fig. 31, each cluster comprises items such as a 

JE" "cluster ID" column 3101, a "number of members" column 3102, 

- - a "words of high incidence" column 3103, a "document contents" 

ITU column 3104, and a "degree of similarity to center" column 3105, 

N thereby enabling the operator to operate the display in units. 

m 

^ 15 The "cluster ID" column 3101 displays serial numbers 

showing the cluster IDs. The "number of members" column 3102 
displays the calculated number of documents, or some of the 
documents, belonging to the cluster. The words having the 
highest incidence in these documents are extracted and 

20 displayed in the "words of high incidence" column 3103. The 
contents of the documents are displayed in the "document 
contents" column 3104, and the degree of similarity to the 
center is expressed in numerical form and displayed in the 
"degree of similarity to center" column 3105. This makes it 

25 easier for the operator to understand the information. 



The operator can detect the characteristics of the 
clusters based on the information (amount of characteristics) 
displayed. Here, when there is one cluster whose contents 
(characteristics) can be understood, it can be selected by the 
5 cluster selection specifier 3010. 

More specifically, by moving the cursor 3110 to a 
predetermined position of the displayed cluster, for example 
to the "cluster ID" column 3101 using the mouse 210 or 312 or 
the like, and clicking on that position, the entire cluster of 

10 that cluster ID can be selected. It is acceptable to select 
some, rather than all, of the documents belonging to the 
selected cluster. 

In Fig. 31, the "cluster ID" column 3101 has been clicked, 
whereby the entire cluster is displayed in inverse video, and 

15 the cluster (cluster ID "1") is selected. 

Furthermore, when. there is no cluster with comprehensible 
contents, the operator resets the classification parameters 
using the classification parameter specifier 3005, and executes 
another classification . 

20 Data relating to the cluster ID selected by the cluster 

selection specifier 3010 is transmitted to the classification 
category memory 3009 . The classification category memory 3009 
retrieves and stores the above amount of characteristics from 
the cluster characteristics calculator 3008, based on the data 

25 relating to the cluster ID. 
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Similarly, the classification category memory 3009 
retrieves and stores the classification result from the 
classification result memory 3006. Moreover, the 

classification category memory 3009 can simultaneously store 
5 information representing comments (e.g. "network maintenance 
cost is high") about clusters input by the operator. Storing 
information created by the operator as constituent elements of 
the classification category in this way increases the 
utilizable value of the classification category. 

10 When an interface for other viewing operations is 

provided, data stored in the classification category memory 
3009 can be structured and categorized manually, or 
automatically by using the degree of similarity of the stored 
clusters to the cluster center, while viewing contents of 

15 selected and stored clusters, and pinpointing meaningful 
connections therebetween . 

Next, a processing sequence of the document 
classification device according to the second embodiment will 
be explained. Fig. 32 is a flowchart showing a processing 

20 sequence of the document classification device according to the 
second embodiment. In the flowchart of Fig. 32, firstly, the 
document to be classified is input (Step S3201) . 

Next, the language of the input document is analyzed (Step 
S3202), a document characteristic vector is created based on 

25 the result of the analysis, that is, based on the extracted words 



(Step S3203) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
has been specified (YES in Step S3204), the document is 
classified in compliance with the specified classification 
parameter (Step S3205), and the result, that is, information 
relating to the clusters, is stored (Step S3206) . 

Next, the characteristics of the classified clusters are 
calculated (Step S3207), and the calculated results are 
displayed (Step S3208) . It is determined whether any of the 
displayed clusters has been selected (Step S3209) , and if not 

(NO in the Step S3209) , processing shifts to the Step S3204 and 
waits once more for a classification parameter to be specified 

(Step S3204) . 

On the other hand, when it is determined in the Step S3209 
that a cluster has been selected (YES in the Step S3209), a 
classification category for the selected cluster is created and 
stored (Step S3210) . At this time information relating to 
clusters input by the operator can also be stored. Here, the 
processing series ends. 

As described above, according to the document 
classification device of the second embodiment, an expression 
space conversion coefficient, for converting the documents to 
expression space capable of projecting the meaningful 
connections between the documents, is calculated based on the 



degree of similarity between documents in document clusters to 
be classified, and the documents are classified in the 
expression space. Consequently, the documents can be 
classified in a manner that reflects the intentions of the 
5 operator. 

Therefore, clusters can be obtained from the classifier 
3004, and in addition, the clusters can be structured and 
categorized based on their contents by the cluster 
S characteristics calculator 3008 and the classification 

ry 10 category memory 3009, using the degree of similarity of the 
j£~ clusters to the cluster center and the like. 

- Furthermore, it is possible to structure and categorize 

f:===: 

fU clusters closer to the intentions of the operator by using only 

ill I 

SJ the clusters selected by the cluster selection specifier 3010. 

;,f| 

O 15 In addition to the second embodiment described above, a 

vector memory and a vector corrector may be added to the 
constitution as in the third embodiment described below. 

Since the information processing system comprising the 
document classification device according to the third 

20 embodiment of the present invention is the same as the first 
embodiment shown in Fig . 1, fur ther explanation will be omitted . 
Furthermore, since the hardware constitutions of the server 101 
and the clients 102 are the same as the first embodiment shown 
in Figs. 2 and 3, explanation thereof will be omitted. 

25 Next, the functional constitution of a document 



classification device according to the third embodiment will 
be explained. Fig. 33 is a block diagram showing a functional 
constitution of the document classification device according 
to the third embodiment. In Fig. 33, like members to those in 
5 Fig. 30 of the second embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

In the block diagram of Fig. 33, the document 
classification device comprises an input section 3001, a 
language analyzer 3002, a vector creator 3003 , a classifier 3004 , 

10 a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 

15 3301, and a vector corrector 3302. 

The vector memory 3301 stores document characteristic 
vectors created by the vector creator 3003. Furthermore, the 
vector corrector 3302 corrects document characteristic vectors, 
stored in the document characteristic vector memory 3301, by 

20 deleting document characteristic vectors of documents 
belonging to the portion of clusters selected by the cluster 
selection specifier 3010. 

Furthermore, the classifier 3004 classifies the 
documents based on the document characteristic vectors 

25 corrected by the vector corrector 3302. 



The vector memory 3301 and the vector corrector 3302 are 
controlled in accordance with commands from the CPU 201 and 301, 
and the like, in compliance with commands written in programs 
recorded in recording media such as a ROM 202 and 302, a RAM 
5 203 and 303, or a disk device 306, and a hard disk 316. 

The document characteristic vectors (row vectors) and 
word (word characteristics ) vectors ( line vectors ) are created 
in the vector creator 3003 , and stored in the vector memory 33 01 . 
This is in order to secure the document characteristic vectors 
10 to be used in subsequent classifications. 

The vector corrector 3302 deletes all or some of the 
documents belonging to the clusters selected by the cluster 
selection specifier 3010, so that these documents are also 
deleted from subsequent classifications . The deleted document 
15 characteristic vectors are stored in the vector memory 3301. 

As a result, of the vector data being stored in the vector 
memory 3301, the data to be used in subsequent classifications 
are those whose document (or a part thereof, as specified by 
the operator) row vectors belong to the selected clusters. 
20 Next, a processing sequence of the document 

classification device according to the third embodiment will 
be explained. Fig. 34 is a flowchart showing a processing 
sequence of the document classification device according to the 
third embodiment. In the flowchart of Fig. 34, firstly, the 
25 document to be classified is input (Step S3401) . 
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Next, the language of the input document is analyzed (Step 
S3402), a document characteristic vector is created based on 
the result of the analysis, that is, based on the extracted words 
(Step S3403), and the created document characteristic vectors 
5 are stored (Step S3404) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
has been specified (YES in Step S3405), the document is 
classified in compliance with the specified classification 

10 parameter (Step S3406) , and the result, that is, information 
relating to the clusters, is stored (Step S3407) . 

Next, the characteristics of the classified clusters are 
calculated (Step S3408), and the calculated results are 
displayed (Step S3409) . It is determined whether any of the 

15 displayed clusters has been selected (Step S3410), and if not 
(NO in the Step S3410) , the processing shifts to the Step S3405 
and waits once more for a classification parameter to be 
specified (Step S3405) . 

On the other hand, when it is determined in the Step S3410 

20 that a cluster has been selected (YES in the Step S3410), a 
classification category for the selected cluster is created and 
stored (Step S3411) . At this time information relating to 
clusters input by the operator can also be stored- Thereafter, 
it is determined whether a repeat of the processing has been 

25 specified (Step S3412) . 



In the Step S3412, when a repeat of the processing has 
been specified (YES in Step S3412) , all or some of the documents 
belonging to the selected clusters are deleted by document 
characteristic vector correction (Step S3413) . Thereafter, 
5 the processing shifts to the Step S3405, and all the processes 
from the Steps S3405 to S3413 are repeated. 

On the other hand, in the Step S3412, when a repeat of 
the processing has not been specified (NO in the Step S3412), 
J» the processing series ends. 

fy 10 As described above, according to the document 

n y _ 

JS" classification device of the third embodiment, the vector 

s - memory 3301 creates a new cluster in which the effects of 

IU clusters which are already known is removed. 

:: tar 

SI In the third embodiment described above, a vector memory 

i; S is and a vector corrector are added to the constitution, but a 
document expression space corrector may be added instead of the 
vector corrector, as in a fourth embodiment described below. 

Since the information processing system comprising the 
document classification device according to the fourth 
20 embodiment of the present invention is the same as the first 
embodiment shown in Fig. 1, further explanation will be omitted . 
Furthermore, since the hardware constitutions of the server 101 
and the clients 102 are the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 
25 will be omitted. 



Next, the functional constitution of a document 
classification device according to the fourth embodiment will 
be explained. Fig. 35 is a block diagram showing a functional 
constitution of the document classification device according 
5 to the fourth embodiment. In Fig. 35, like members to those 
in Fig. 30 of the second embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

In the block diagram of Fig. 35, the document 
classification device comprises an input section 3001, a 

10 language analyzer 3002, a vector creator 3003 , a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
category memory 3009, a cluster selection specifier 3010, a 

15 classification category viewing operator 3011, a vector memory 
3501, and a document expression space corrector 3502. 

The vector memory 3501 stores document characteristic 
vectors created by the vector creator 3003. Furthermore, the 
document expression space corrector 3502 corrects the document 

20 expression space, used when determining the degree of 
similarity between document characteristics vectors stored in 
the document characteristic vector memory 3501, based on an 
amount of characteristics calculated from the portion of 
clusters selected by the cluster selection specifier 3010. 

25 Furthermore, the classifier 3004 classifies the 



documents using the document expression space corrected by the 
document expression space corrector 3502, based on the degree 
of similarity between the document characteristic vectors 
created by the vector creator 3003. 
5 The vector memory 3501 and the document expression space 

corrector 3502 are controlled in accordance with commands from 
the CPU 201 and 301, and the like, in compliance with commands 
™ written in programs recorded in recording media such as a ROM 

% 202 and 302, a RAM 203 and 303, or a disk device 306, and a hard 

D 10 disk 316. 

Jr" Next, the contents of the document expression space 

corrector 3502 will be explained. In the vector corrector 3302 
U in the third embodiment, document characteristic vectors were 

'4 deleted to eliminate the effects of clusters that were already 

D 15 known, but the multidimensional space in which the document 
characteristic vectors are expressed was not altered. 

Therefore, when format characteristics of clusters 
selected by the operator in the previous classification are to 
be eliminated from the next classification, the space in which 
20 the document characteristic vectors are expressed must itself 
be altered. 

The document expression space corrector 3502 is provided 
for this purpose, and corrects the document expression space. 
Here, as example where the characteristic dimensions of the 
25 document expression space is altered by deleting the 
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characteristic dimension having a high degree of similarity 
with the center of a cluster selected by the operator. 

Since the center of a cluster selected by the operator 
can be expressed as a vector, the degree of similarity between 
5 this cluster center vector and the characteristic dimensions 
of the document expression space stored in the vector memory 
3501 is calculated, so as to identify the characteristic 
dimensions with a high degree of similarity. 

The cosine, inner product, the Euclidean distance, the 

10 Mahalanobis distance, or the like, is used to measure the degree 
of similarity. Furthermore, characteristic dimensions with a 
high degree of similarity can be identified by threshold value 
processing, in which characteristic dimensions with a degree 
of similarity exceeding a certain degree of similarity are 

15 deleted, or fixed-number processing, in which a fixed number 
of characteristic dimensions with a high degree of similarity 
are deleted. Furthermore, discriminatory analysis or the like 
can be performed. 

The document express space corrector 3502 deletes the 

20 characteristic dimensions after calculating those which are to 
be deleted. Deletion is carried out by deleting the line 
vectors of characteristic dimensions identified from a matrix 
of "characteristic dimensions (words)" x "documents" stored in 
the vector memory 3501. The document vectors corrected by the 

25 document express space corrector 3502 are stored in the vector 



memory 3501 to be used in subsequent classifications. 

Next, a processing sequence of the document 
classification device according to the fourth embodiment will 
be explained. Fig. 36 is a flowchart showing a processing 
5 sequence of the document classification device according to the 
fourth embodiment. In the flowchart of Fig. 36, firstly, the 
document to be classified is input (Step S3601) . 

Next, the language of the input document is analyzed (Step 
S3602), a document characteristic vector is created based on 
10 the result of the analysis, that is, based on the extracted words 
(Step S3603) , and the created document characteristic vectors 
are stored (Step S3604) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
15 has been specified (YES in Step S3605), the document is 
classified in compliance with the specified classification 
parameter (Step S3606) , and the result, that is, information 
relating to the clusters, is stored (Step S3607) . 

Next, the characteristics of the classified clusters are 
20 calculated (Step S3608), and the calculated results are 
displayed (Step S3609) . It is determined whether any of the 
displayed clusters has been selected (Step S3610) , and if not 
(NO in the Step S3610) , the processing shifts to the Step S3605 
and waits once more for a classification parameter to be 
25 specified (Step S3605) . 



On the other hand, when it is determined in the Step S3610 
that a cluster has been selected (YES in the Step S3610) , a 
classification category for the selected cluster is created and 
stored (Step S3611) . At this time, information relating to 
5 clusters input by the operator can also be stored. Thereafter, 
it is determined whether a repeat of the processing has been 
specified (Step S3612) . 

In the Step S3612, when a repeat of the processing has 
been specified (YES in Step S3612), the document expression 
10 space is corrected by deleting the line vectors of the 
characteristic dimensions identified from the matrix 
"characteristic dimensions (words) " x "documents" (StepS3613). 
Thereafter, the processing shifts to the Step S3605, and all 
the processes from the Steps S3605 to S3613 are repeated. 
15 On the other hand, in the Step S3612, when a repeat of 

the processing has not been specified (NO in the Step S3612) , 
the processing series ends. 

As described above, according to the document 
classification device according to the fourth embodiment, 
20 format characteristics of a cluster selected by the operator 
in a previous classification can be deleted from subsequent 
classifications by the document express space corrector 3502, 
enabling a new cluster to be created in the deleted state. 

In the third and fourth embodiments described above, 
25 either one of a vector corrector and a document express space 
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corrector are added to the constitution, but both the vector 
corrector and the document expression space corrector may be 
added instead of the vector corrector, as in a fifth embodiment 
described below. 

5 Since the information processing system comprising the 

document classification device according to the fifth 
embodiment of the present invention is the same as the first 
embodiment shown in Fig . 1, further explanation will be omitted . 
Furthermore, since the hardware constitutions of the server 101 

10 and the clients 102 are the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 
will be omitted. 

Next, the functional constitution of a document 
classification device according to the fifth embodiment will 

15 be explained. Fig. 37 is a block diagram showing a functional 
constitution of the document classification device according 
to the fifth embodiment. In Fig. 37, like members to those in 
Fig. 30 of the second embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

20 In the block diagram of Fig. 37, the document 

classification device comprises an input section 3001, a 
language analyzer 3002, a vector creator 3003, a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 

25 cluster characteristics calculator 3008, a classification 



category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 
3701, a vector corrector 3702, and a document expression space 
corrector 3703. 

5 The vector memory 3701 stores document characteristic 

vectors created by the vector creator 3003. Furthermore, the 
vector corrector 3702 corrects the document characteristic 
vectors, stored in the document characteristic vector memory 
]S 3301, by deleting document characteristic vectors of documents 

ry 10 belonging to the portion of clusters created by the classifier 
J~ 3004. 

* - Furthermore, the document expression space corrector 

ly 3703 corrects the document expression space, used when 

'""'4 determining the degree of similarity between document 

;; S 15 characteristics vectors stored in the document characteristic 
vector memory 3701, based on the characteristics of clusters 
selected by the cluster selection specifier 3010. 

Furthermore, the classifier 3004 classifies the 
documents based on the degree of similarity between document 
20 characteristic vectors corrected by the vector corrector 3702, 
using the document expression space corrected by the document 
expression space corrector 3703. 

The vector memory 3701, the vector corrector 3702, and 
the document expression space corrector 3703 are controlled in 
25 accordance with commands from the CPU 201 and 301, and the like, 



in compliance with commands written in programs recorded in 
recording media such as a ROM 202 and 302, a RAM 203 and 303, 
or a disk device 306, and a hard disk 316. 

Next, the contents of the vector corrector 3702 and the 
5 document expression space corrector 3703 will be explained. In 
the fourth embodiment, documents belonging to a selected 
cluster are used in subsequent classifications. 
m In the fifth embodiment, since the vector corrector 3702 

J and the document expression space corrector 3703 are both 

pj 10 provided, documents belonging to selected clusters are deleted 
Jgf from subsequent classifications, and are not classified in 

- subsequent classifications. 

iy In the fourth embodiment, the aspect of topic extraction 

1x1 

S! is emphasized, and it is assumed that a given document can be 

i; fl 15 classified under multiple topics. For example, in an 
investigation into networking, the following answer is given: 
"The end user enquires about how to install the software, and 
so cannot work as a system manager.". This can be classified 
under the topic of "difficulties relating to understanding the 
20 software operation", but can also be classified under the topic 
of "busy nature of system manager work". 

The fourth embodiment addresses the need to be able to 
extract both the cluster "difficulties relating to 
understanding the software operation" and the cluster "busy 
25 nature of system manager work". 
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Conversely, since the operator already knows topics which 
have been extracted once, there will be cases when he or she 
desires a different result from the next classification. The 
fifth embodiment addresses this requirement by providing the 
5 vector corrector 3702, thereby ensuring that all or part of 
documents belonging to clusters selected in the nth 
classification are deleted from subsequent classifications. 

Documents belonging to clusters which have been specified 
for selection by the cluster selection specifier 3010 are stored 
10 in row vector format in the vector memory 3701. Therefore, 
document clusters for subsequent classification are created by 
deleting these row vectors using the vector corrector 3702. 

Moreover, as in the fourth embodiment, in accordance with 
the selected clusters, the document expression space corrector 
15 3703 deletes the characteristic dimension from the matrix 
stored in the vector memory 3701. 

Next, a processing sequence of the document 
classification device according to the fifth embodiment will 
be explained. Fig. 38 is a flowchart showing a processing 
20 sequence of the document classification device according to the 
fifth embodiment. In the flowchart of Fig. 38, firstly, the 
document to be classified is input (Step S3801) . 

Next, the language of the input document is analyzed (Step 
S3802), a document characteristic vector is created based on 
25 the result of the analysis, that is, based on the extracted words 



(Step S3803) , and the created document characteristic vector 
is stored (Step S3804) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
has been specified (YES in Step S3805), the document is 
classified in compliance with the specified classification 
parameter (Step S3806) , and the result, that is, information 
relating to the clusters, is stored (Step S3807) . 

Next, the characteristics of the classified clusters are 
calculated (Step S3808), and the calculated results are 
displayed (Step S3809) . It is determined whether any of the 
displayed clusters has been selected (Step S3810) , and if not 
(NO in the Step S3810) , the processing shifts to the Step S3805 
and waits once more for a classification parameter to be 
specified (Step S3805) . 

On the other hand, when it is determined in the Step S3810 
that a cluster has been selected (YES in the Step S3810) , a 
classification category for the selected cluster is created and 
stored (Step S3811) . At this time, information relating to 
clusters input by the operator can also be stored. Thereafter, 
it is determined whether a repeat of the processing has been 
specified (Step S3812) . 

In the Step S3812, when a repeat of the processing has 
been specified (YES in Step S3812) , all or some of the documents 
belonging to the selected clusters are deleted by document 
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characteristic vector correction (Step S3813). 

Following the Step S3813, the document expression space 
is corrected by deleting the line vectors of the characteristic 
dimensions identified from the matrix "characteristic 
5 dimensions (words)" x "document" (Step S3814). Thereafter, 
the processing shifts to the Step S3805, and all the processes 
from the Steps S3805 to S3814 are repeated. 
~ On the other hand, in the Step S3812, when a repeat of 

,.|= the processing has not been specified (NO in the Step S3812) , 

fy 10 the processing series ends. 

JE' As described above, according to the document 

s - classification device of the fifth embodiment, the vector 

fU corrector 3702 eliminates the effects of clusters which are 

already known, and in addition, the document expression space 

"■s=p 

■O 15 corrector 3703 eliminates the format characteristics of a 
cluster selected by the operator in a previous classification 
from subsequent classifications, thereby enabling a new cluster 
to be created in the deleted state. 

In the second and fourth embodiments described above, 
20 when classification was repeatedly carried out, no 
consideration was given to information relating to how many 
times a document was selected, but when the constitution 
comprises a selection information appender, as in a sixth 
embodiment described below, selection information can be 
25 displayed together with cluster characteristics. 
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Since the information processing system comprising the 
document classification device according to the sixth 
embodiment of the present invention is the same as the first 
embodiment shown in Fig . 1 , further explanation will be omitted . 
5 Furthermore, since the hardware constitutions of the server 101 
and the clients 102 are the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 
will be omitted. 

Next, the functional constitution of a document 

10 classification device according to the sixth embodiment will 
be explained. Fig. 39 is a block diagram showing a functional 
constitution of the document classification device according 
to the sixth embodiment. In Fig. 39, like members to those in 
Fig. 35 of the fourth embodiment are represented by like 

15 reference symbols, and explanation thereof is omitted. 

In the block diagram of Fig. 39, the document 
classification device comprises an input section 3001, a 
language analyzer 3002 , a vector creator 3003 , a classifier 3004 , 
a classification parameter specifier 3005, a classification 

20 result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 
3501, a document expression space corrector 3502, and a 

25 selection information appender 3901. 
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When all or some documents belonging to a cluster portion 
of documents created by the classifier 3004 have been selected, 
the selection information appender 3901 appends selection 
information showing that the documents have been selected. 
5 Furthermore, the cluster characteristics display 3007 displays 
the cluster characteristics, and the selection information 
appended by the selection information appender 3901. 
™ The selection information appender 3901 are controlled 

: S; in accordance with commands from the CPU 201 and 301, and the 

ifil 10 like, in compliance with commands written in programs recorded 
j»" in recording media such as a ROM 202 and 302, a RAM 203 and 303, 

* - or a disk device 306, and a hard disk 316. 

fy Next, the detailed contents of the selection information 

Si appender 3901 will be explained. In a questionnaire, 

Ct 15 experience has taught that unique and highly opinionated 
answers are extremely important. This is because many answers 
could not have been anticipated by the person who planned the 
questionnaire . 

Accordingly, in a case where documents belonging to a 
20 cluster selected by the operator are used in subsequent 
classifications, it is possible to improve the ability to 
identify documents used on multiple occasions, and also the 
ability to identify documents which have not been selected at 
all, by showing how many times the documents have been selected 
25 when the cluster characteristics display 3007 displays the 



individual documents . 

Fig. 40 is a diagram explaining a table 4000 provided in 
the classification result memory 3006 of the document 
classification device according to the sixth embodiment. In 
Fig. 40, table contents are listed for each document ID, and 
the table 4000 shows in which cycle each document was selected 
by the operator during classification . That is , when a document 
has been selected, selection information of "1" is entered, and 
when a document has not been selected, selection information 
of "0" is entered. 

For example, when a document has been selected four times, 
the table 4000 shows that document ID "1" was selected by the 
operator in the first and second classifications, but was not 
selected in the third and fourth classifications . On the other 
hand, document ID "2" has not once been selected yet, indicating 
that it is an opinion unknown to the operator. 

Based on such information, when the cluster 
characteristics display 3007 displays the documents to the 
operator, the display may for example be altered in accordance 
with the number of times the documents have been selected. For 
example, visual characteristics such as the color of the letters, 
the density of the background, and the color intensity may 
conceivably be altered. 

Furthermore, the number of selections can be directly 
displayed by numerical symbols, graphs, or the like. In any 



case, as long as it is possible to visually identify selected 
documents and unselected documents, the constitution is not 
limited to that described above. 

Furthermore, the selection information may be viewed 
using the classification category viewing operator 3011. 

Next, the processing performed by the selection 
information appender 3901 will be explained. Fig. 41 is a 
flowchart showing a processing sequence of the selection 
information appender 3901 of the document classification device 
according to the sixth embodiment. In the flowchart of Fig. 
41, firstly, classification . is carried out (Step S4101), and 
then, the first document is extracted (Step S4102). 

It is determined whether the extracted document has been 
selected for classification in the Step S4101 (Step S4103) . 
Here, when the document has been selected (YES in the Step S41.03) , 
data "1" is stored as the selection information (Step S4104) . 
On the other hand, when the document has not been selected (NO 
in the Step S4103), data "0" is stored as the selection 
information (Step S4105) . 

Next, it is determined whether or not the processing of 
the document has ended (Step S4106) . Here, when all of the 
documents have not been processed (NO in the Step S4106) , the 
next document is extracted (Step S4107) , the processing shifts 
to the Step S4103, and the Steps S4103 to S4107 are repeated. 

On the other hand, in the Step S4106, when all the 



documents have been processed (YES in the Step S4106) , the 
processing shifts to the Step S4101, and classification is 
performed again (Step S4101) . In this way, the number of times 
that the processing between the Steps S4101 to S4107 is repeated 
5 is equal to the number of classifications. 

As explained above, according to the sixth embodiment, 
the selection information appender 3901 appends selected 

„ information, which is displayed by the cluster characteristics 

display 3007, and consequently, it is possible to improve the 

fj 10 ability to identify documents used on multiple occasions, and 

if" also the ability to identify documents which have not been 

a • selected at all . 

SU The document classification method described in the 

: Sj second to fifth embodiments is realized by running a 

"0 15 predetermined program on a computer, such as a personal computer 
or a work station. The program is recorded on a computer- 
readable recording medium such as a hard disk, a floppy disk, 
a CD-ROM, an MO, or a DVD, and is executed by reading out the 
program from the recording medium using the computer. 
20 Furthermore, the program can be distributed via the recording 
medium, or by using a network such as the Internet as a 
transmission medium. 

Next, an information classification device according to 
the seventh to sixteenth embodiments will be explained. In the 
25 present embodiment of the present invention, when one or more 



collections of sentences written in a natural language is/are 
to be classified, this will be termed a document. By way of 
a more specific example, patent laid-open publications 
classified by IPC classification, or newspaper articles 
5 classified into specific fields such as politics, economics, 
culture, science and technology, and the like, are documents. 
When claims and specific sentences are extracted therefrom, 
n these are regarded either as sentences under the classification 

j? of "claims", or, in the case of specific sentences which can 

fU 10 be classified according to intended usage, these are regarded 
as documents. There follows a detailed description of the 
* • seventh to sixteenth embodiments of the present invention based 

W on the drawings . 

X* Fig. 42 is a block diagram showing a constitution of a 

b -D is document classification device according to the seventh 
embodiment of the present invention. As shown in Fig. 42, the 
document classification device of the seventh embodiment 
comprises a document input section (document input means) 5001 
for inputting document data groups, a document divider 
20 (document dividing means) 5002 for dividing document data into 
one or multiple divided document data based on a predetermined 
reference, a document-divided document map creator 
(document-divided document map creation means) 5003 for 
creating a map showing the correspondence between the document 
25 data and the divided document data, a divided document 



classifier (divided document classifying means) 5004 for 
classifying the divided document data, that is, the divided 
document, a divided document classification result creator 
(divided document classification result creation means) 5005 
5 for creating divided document classification result 
information, a document classification result creator 
(document classification result creation means) 5006 for 
creating classification result information of the above 
document data using the document-divided document map and the 

10 divided document classification result information, etc. 

The document divider 5002, the document-divided document 
map creator 5003, the divided document classifier 5004, the 
divided document classification result creator 5005, and the 
document classification result creator 5006 have a shared or 

15 independent memory for storing programs and a CPU, which 
operates in compliance with the programs. 

Next, the document classification device and the document 
classification method of the seventh embodiment will be 
explained in detail in accordance with Fig. 42 and the like. 

20 Firstly, the document input section 5001 inputs a group of 
documents. The document input section 5001 comprises a 
keyboard, an OCR device, a detachable recording medium, or 
network communications means, and the documents are input via 
any one of these. 

25 Then, document divider 5002 extracts the document data, 



divides them based on a predetermined reference, and creates 
one or multiple divided document data from one document data. 
The document data is divided using a method specified by the 
user, such as using information relating to the structure of 
5 the documents, or information relating to the constituents of 
the documents. The method used will not be considered here. 

Fig. 43 shows an example of creating multiple divided 
„ document data from document data using the document 

classification device and document classification method of the 
ITU 10 present invention. In this example, a document 1 comprises 
jp" multiple news topics, and each one-minute topic forms one 

r - document unit. As shown in Fig. 43, the news topics are 

fy separated by two line-break codes. The document 1, comprising 

ITJ 

%j one document, is divided using this stipulation to create seven 

it S~= 

'■B 15 divided document data of divided documents 1-1 to 1-7, each 
comprising a separate topic. It is also possible to include 
the document 1 in its state prior to division in the data, but 
this is not done here. 

When the document has been divided, the document-divided 
20 document map creator 5003 creates a map showing the document 
data prior to division in correspondence with the divided 
document data created from the document data. For example, the 
document-divided document map creator 5003 creates a map 
comprising identifiers uniquely representing individual 
25 document data, and identifiers uniquely representing 
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individual divided document data, or a map comprising 
identifiers uniquely representing divided document data for 
each document data. The method for arranging the document data 
and divided document data in mutual correspondence will not be 
5 considered here. 

Fig. 44 shows an example of creating a document-divided 
document map. In Fig. 44, the documents 1 to 3 represent 
document data, and the divided documents 1 to 12 represent 
divided document data. As shown in the diagram, identification 
;f|J 10 numbers (identifiers) for uniquely identifying the document 
data and the divided document data are appended. Then, as shown 
,7 . in the bottom left portion of Fig. 44, the identification 

jjtj numbers of the document data and the identification numbers of 

%J the divided document data are listed in mutual correspondence 

iifl 15 in table format. When multiple divided document data can be 
regarded as identical with regard to the reference used for the 
document classification, identical identification numbers may 
be appended thereto. 

Thereafter, the divided document classifier 5004 
20 classifies the divided documents. The divided documents can 
be classified by, for example, language-analyzing the 
individual divided documents, counting the incidence of words 
contained therein, determining a characteristics vectors 
quantitatively showing the characteristics of the documents 
25 based on the result of the language analysis, and then using 
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a method such as square contingency, discriminatory analysis, 
or cluster analysis. 

Next, the divided document classification result creator 
5005 creates divided document classification result 
5 information based on the result of the divided document 
classification (see Fig. 45) . Here, the divided document 
classification result information comprises, for example, (a) 
information relating to categories to which the divided data 
belong (e.g. information of the items "classification category" 

10 and "representative value and distance of categories to which 
the documents belong" in the table of "Results of classifying 
divided document data into three categories" shown in Fig. 45) , 
(b) information relating to individually created categories 
(e.g. information of the items "representative value" and 

15 "number of data belonging to category (number of divided 
document)" in the table of "Information Relating to 
Classification Categories" shown in Fig. 45), (c) information 
between created categories (e.g. information in the table of 
"Distance between Classification Categories" in Table 4), (d) 

20 and such like. The user can also use the various information 
mentioned above as basic data for analyzing the classification 
result . 

Fig. 45 shows an example of creating a classification 
result in a case where twelve divided document data are 
25 classified into three categories using their quantitative 
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characteristics vectors- The quantitative three-dimensional 
vectors of the divided document data (the number of components 
of the vector is the number of all the types of words originating 
in the classified document cluster, but here, the vectors are 
5 linearly converted to three-dimensional vectors in which 
several words have been deleted) can be classified into three 
categories by utilizing a cluster analysis method such as, for 
.!=! example, Ward's method. 

1= That is, each of the divided document data belongs to one 

m 10 of the three categories shown in the diagram. The 

~~ 

4=" representative value of each category to which the divided 

s: - document data belong is an average value of the characteristics 

fU vector of the divided document data which belong to the category 

Si (the center of the divided document data which belong to the 

■S is category) . 

Furthermore, the distance (corresponding to the degree 
of similarity) to the representative value of the category to 
which the data belongs can be determined (for example, in the 
case of the divided document 3 in Fig. 45) using the value of 
20 divided document 3 in the divided document data characteristics 
vector item, and the value of the item of the representative 
value (center of the divided document category) of the category 

2, which is the classification category for the divided document 

3, in the following equation. 

25 ( (3.00-2. 66) 2 + (2.00-2.00) 2 - (4 . 00-3 . 66) 2 ) 4 = 0.48 



Hence, the smaller the distance to the representative 
value of the category to which the divided document belongs, 
the higher the degree of similarity with the average divided 
document belonging to that category. 
5 In addition to the statistics shown in Fig, 45, various 

statistics can be created, such as dispersion within or between 
categories, the range of the degree of similarity in each 
category, etc. 

Then, the document classification result creator 5006 

10 uses the document-divided document map and the divided document 
classification result information to create classification 
result information of the. document data, such as that shown in 
Fig. 46, for example. As shown in the example of Fig. 46, for 
each category, classification result information such as 

15 divided document data belong to each category, the degree of 
similarity thereof (distance to the representative value of the 
category to which the data belongs) , the pre-division document 
data to which the divided document data belongs (document to 
which data belongs), the area occupied by the document (the 

20 share of the category occupied by the divided document data) , 
the relative position of the divided document data in the 
document (order) , and the degree of similarity ranking of the 
divided document data within the category to which it belongs, 
are created. 

25 In the above example, document to which data belongs is 
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obtained from the document-divided document map, and other 
classification result information is obtained from the divided 
document classification result information. In addition to 
the information shown in Fig. 46, the document classification 
5 result creator 5006 can use various statistics, such as the 
dispersion of the data within categories, and the deviation 
value of the divided document data within the category to which 
it belongs, and the contents of the document data and the divided 
document data, and the like, as the classification result 

10 information. 

Furthermore, in the example described above, all the 
results are expressed in table format as units of divided 
document data, but the classification categories and document 
data can also be expressed units. Furthermore, the 

15 classification result information need not only be expressed 
in text format, but can also be expression graphically, making 
it more comprehensible to the user. 

Thus, according to the present invention, one document 
is divided, the divided document is classified, and the 

20 relationship between the document prior to division and the 
divided document is displayed to the user. Furthermore, the 
classification result of the divided document is displayed to 
the user. Therefore, when one document contains multiple 
topics and meanings, the document is not classified into 

25 categories limited to specific topics and meanings, or 
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"classified into categories different from those desired by the 
user, making the classification categories more easily 
comprehensible to the user. Furthermore, since the position 
of the divided document in the document prior to division (the 
5 document to which the divided document belongs) is displayed, 
the user can efficiently read the part of the document cluster 
that he or she wants to read. 
m Fig. 47 is a block diagram showing a constitution of the 

- 

2 document classification device according to an eighth 

m 10 embodiment of the present invention. As shown in Fig. 47, in 
addition to the constitution shown in the seventh embodiment 
si - of Fig. 42, the document classification device according to the 

fd eighth embodiment are added (a) a document saving section 

IU 

Si (document saving means) 5007 for saving document data, (b) a 

^fi 15 divided document saving section (divided document saving means ) 
5008 for saving divided document data, and (c) a document- 
divided document map saving section (document-divided document 
map saving means) 5009 for saving a document-divided document 
map created by the document-divided document map creator 5003. 
20 The saving sections for example comprise shared hard disks, 
semiconductor memories, or the like. 

With the constitution described above, the document 
saving section 5007 of the present embodiment saves information 
accompanying the document, such as the contents of the document, 
25 the author of the document, the date of authorship, the date 
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of last correction, in an appropriate format. Furthermore, 
when the document has a quantitative characteristics vector 
comprising elements of the document, in addition to the document 
contents, these are also saved in the document saving section 
5 5007. When identifiers uniquely expressing the individual 
document data are appended in the document input section 5001, 
the document saving section 5007 also saves these identifiers 
in an appropriate format. 

Furthermore, the divided document saving section 5008 the 

10 contents of the divided document data created by the document 
divider 5002 in an appropriate format, and in addition, saves 
quantitative characteristics vectors. When identifiers 
uniquely expressing the individual document data are appended, 
the divided document saving section 5008 also saves the 

15 identifiers in an appropriate format. 

Furthermore, the document-divided document map saving 
section 5009 saves document-divided document maps created by 
the document-divided document map creator 5003 in an 
appropriate format . 

20 According to the eighth embodiment, since document data, 

divided document data, and document-divided document maps are 
saved in this way, for a single document data it is possible 
to efficiently determine classification results having 
different parameters such as the number of classifications, the 

25 classification method, and the settings used in the 
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classifications, without recreating the divided document data 
and the document-divided document map. Furthermore, by 
classifying the document data and saving the data needed to 
create the classification result, the user is free to take more 
5 time over the classification, and to re-analyze previously 

i 

classified documents within a given period of time. 

Fig. 48 is a block diagram showing a constitution of the 

I 

m document classification device according to a ninth embodiment 

of the present invention. As shown in Fig. 48, in addition to 
ill 10 the constitution shown in the eighth embodiment of Fig. 47, the 
Jp " document classification device of the present embodiment 

5i further comprises a divided document classification result 

1U saving section (divided document classification result saving 

N means) 5010 for saving the divided document classification 

i; B 15 results created by the divided document classification result 
creator 5005. The divided document classification result 
saving section 5010 comprise, for example, a shared hard disk, 
a semiconductor memory, or the like. 

Thus, according to the ninth embodiment, since document 
20 data, divided document data, document-divided document maps, 
and divided document classification results are saved, in 
addition to the effects of the eighth embodiment, it is possible 
to express the classification result of a single classification 
in various formats, such as textual format, chart format and 
25 graph format. Moreover, since the divided document 
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classification result information is saved, during 
classifications and analysis of classification results, the 
user is free to take more time over the operations, and can 
re-analyze previously classified documents in a variety of 
5 formats within any given period of time. 

In the document classification device and document 
classification method according to the tenth embodiment of the 
..^ present invention, as shown in Fig. 49, a document 1 comprises 

document data prior to division, and is contained in multiple 
rj 10 divided document data created by the document divider 2. As 
"g " a consequence, in the present embodiment, the user is able to 

JT - obtain not only a detailed classification structure of document 

pj data, but also a classification structure fusing a schematic 

%J macro classification structure, obtained as a result of 

hD 15 classifying the document data itself prior to division. 

In the document classification device and document 
classification method according to the eleventh embodiment of 
the present invention, the document divider 2 divides the 
document data based on structural information relating to the 
20 document data. Fig. 50 shows an example of the document 
described by classification object document data or HTML- format . 
Prior to division, structural information is extracted from 
HTML-format document data such as that shown in Fig. 50, and 
divided document data is created from document data by setting 
25 appropriate division stipulation for the documents using that 
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structure. 

That is, taking the tag "LI" in the document data by way 
of example, it is a stipulation for creating divided document 
data to "treat text having tag "LI" as one divided document data" . 
5 By applying this stipulation to the document data, the seven 
divided documents shown in Fig. 50 are created. 

Even when the document does not have a specific structural 
format such as HTML, XML, SGML, as described above, a 
stipulation for division can be created from information 
10 relating to the size of the letters, the decoration of the 
letters, the color of the letters, the font, and the like, 
enabling the document to be divided. Furthermore, when the 
document data comprises an image, and is input by an OCR device 
or the like, a stipulation for division can be created using 
15 information relating to the original layout of the image, or 
the like, enabling a divided document to be created. 

It is not necessary to use all the document data for the 
divided document data. For example, in the example shown in 
Fig. 50, the letter row "News Topic (98/09/25)" is not used in 
20 the divided document. 

Thus, in the eleventh embodiment, structural information 
is extracted from the document data, and the structural 
information is used to set an appropriate stipulation for 
division prior to dividing the document . As a result, different 
25 topics are divided appropriately. Consequently, documents can 
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be classified in such a manner that the detailed classification 
structure of the document data is known. 

In the twelfth embodiment, the document classification 
device and document classification method according to the 
5 seventh to tenth embodiments of the present invention, as shown 
in Fig. 51, further comprise (a) a document element analyzer 
(document element extraction means) 5011 for extracting 
elements such as words contained in the document data, and (b) 
an extractor of information accompanying elements (information 
^1 10 accompanying elements extraction means) 5012 for extracting 
J J * information accompanying the elements such as the part of speech 

J* - accompanying the elements extracted by the document element 

pj analyzer 5011 (Fig. 51 shows an example in which the document 

\j element analyzer 5011 and the extractor of information 

&y 15 accompanying elements 5012 are additionally provided to the 
ninth embodiment of Fig. 48) . The document divider 5002 divides 
the document data using the elements extracted by the document 
element analyzer 5011, and the information accompanying the 
elements extracted by the extractor of information accompanying 
20 elements 5012. 

As shown in Fig. 52, prior to division, the document 
element analyzer 5011, comprising language analysis processing 
means, extracts from the document data elements such as words, 
and the extractor of information accompanying elements 5012 
25 extracts information accompanying the elements such as the 
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parts of speech, and an appropriate stipulation for division 
is set in accordance with the information . The document element 
analyzer 5011 and the extractor of information accompanying 
elements 5012 do not have to be newly provided, since similar 
5 means in the divided document classifier 5004 can be used 
instead . 

In this embodiment, as for example shown in Fig. 52, the 
document data comprises a group of multiple news topics having 
no specific structural information. In this example, the 

10 topics are listed after letter rows comprising: Word "topic" 
+ "number" + "return symbol" . The above structure is identified 
from the extraction results of the document element analyzer 
5011 and the extractor of information accompanying elements 
5012, and after considering the ends of the sentences, the 

15 following division stipulation is created: "With the letter row 
"topic + number + return symbol" as the header, deem a letter 
row comprising the above letter row, and a letter row surrounded 
by a document return symbol, to be one divided document data". 

More specifically, firstly, only the parts of speech and 

20 return symbols are extracted from the extracted words and 
information about parts of speech and the like. Then, letter 
rows "topic + number + return symbol" and document end symbols 
are detected, and their positions in the document are stored. 
Then, a division stipulation is applied to the document data, 

25 creating divided document data such as that shown in Fig. 52. 
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It is not necessary to use all the document data for the 
divided document data. For example, in the example shown in 
Fig. 52, the letter row "News Topic (98/09/25)" is not used in 
the divided document. Furthermore, in the above example, 
5 elements and information accompanying the elements is extracted 
from the document data in order to set a stipulation for division, 
but it is acceptable to extract only the elements, and to set 

,^ a stipulation for division based only on the element 

-•stiff 

p information. 

^ 10 Thus, according to the twelfth embodiment, element 

information and the like is extracted from the document data, 
r and the extracted element information and the like is used to 

llj set an appropriate stipulation for division. Consequently, as 

SJ same as the eleventh embodiment, the document can be classified 

*S 15 in such a manner that the detailed classification structure of 
the document data is known. 

According to the thirteenth embodiment, in the document 
classification device and document classification method 
according to the seventh to the tenth embodiments, the document 
20 divider 5002 divides data in accordance with a specification 
range specified by the user. When the user specifies various 
divided document ranges for document data such as that shown 
in Fig. 53, the document divider 5002 divides the document in 
compliance with the specifications. 
25 In the present embodiment, when classifying a document, 
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the document divider 5002 firstly displays on the screen left 
and right specification points, and a region specification 
object comprising region specification lines, as the 
initialized state in the upper part of the document. In this 
5 state, by using a pointing device such as a mouse to drag the 
left or right specification points and move it up and down, the 
user can select regions of the divided document. 

When making a specification, the document divider 5002 
shows that a region is being selected by changing the color of 

10 the specification pointer from dark to light, and changing the 
region specification line from a solid line to a broken line. 
To select a region, the user need only stop dragging the 
specification point at a position of his own choice. 

Next, the user decides whether or not to make the region 

15 he or she has selected into a divided document. When he or she 
decides not to do so, this decision is shown clearly by the 
document divider 5002 casting a net over the selected region 
on the screen. 

In this way, according to the present embodiment, since 

20 the user can select divided document data from document data 
as he or she wishes, he or she can learn the detailed 
classification structure of the document data. In addition, 
the user can classify documents as he or she wishes. 

According to the fourteenth embodiment, in the document 

25 classification device and document classification method 



according to the seventh to the tenth embodiments, document data 
is divided based on the number of letters, the number of 
sentences, or both the number of letters and the number of 
sentences. For example, the document data shown in Fig. 54 is 
5 divided into units of approximately two hundred letters. 

Here, the units each comprise approximately two hundred 
letters, since there is no guarantee that a unit of exactly two 
>!8Ba hundred letters will end with a full stop. Therefore, the 

~'% nearest full stop before or after the two hundredth letter is 

^ 10 deemed to be the end of the divided document. In this way, the 
.j= divided document of Fig. 54 is created. Similarly, documents 

,,~ can be divided into units comprising a predetermined number of 

ry sentences, and documents can be divided based on both the number 

Sj of letters and the number of sentences. 

m 15 Consequently, according to the fourteenth embodiment, 

since documents can be divided based on the number of letters, 
the number of sentences, or both the number of letters and the 
number of sentences, there is an increased capability to 
classify different documents having contents of different 
20 topics and the like. Therefore, as above, documents can be 
classified so that the detailed classification structure of the 
document data can be known. 

According to the fifteenth embodiment, in the document 
classification device and document classification method 
25 according to the previous embodiments, the document 
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classification result creator 5006 specifies only information 
representing document data, and representative information 
accompanying the document data, as classification result 
information . 

5 As shown for example in Fig. 55, the classification 

categories are displayed at the head, key words representing 
the categories are displayed next to the classification 
categories, and, for example, the document data name (document 
name) of the document data contained in the divided document 

10 data belonging to the categories is displayed below the category 
name, as information representing the document data. 
Furthermore, document, icons are displayed on the left of the 
document data names. When these document icons are specified, 
the contents of the document data are displayed. 

15 Furthermore, document data names of divided document data 

having a high degree of similarity to the category 
representative value are arranged at the head (left side) of 
the list of document data names. Furthermore, when multiple 
divided document data created from the same document data belong 

20 to the same classification category, only a document data name 
corresponding to the divided document data having the highest 
degree of similarity is displayed. The key words are words 
which appear frequently. 

Thus, according to the fifteenth embodiment, since only 

25 information representing document data, and representative 
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information accompanying the document data, are displayed as 
the classification result information, the user can easily 
comprehend the overall classification structure of the document 
data in detail. 

5 According to the sixteenth embodiment of the present 

invention, in addition to specifying the document 
classification result as in the fifteenth embodiment, 
O information representing divided document data and information 

:: P accompanying the divided document data are also displayed. 

0 s ! 10 As shown for example in Fig. 56, the classification 

ir categories are displayed at the head, key words representing 

!\. the categories are displayed next to the classification 

:jt categories, and, for example, the document data name (document 

name) of the document data contained in the divided document 

! :JLJ 

15 data belonging to the categories is displayed below the category 
name, as information representing the document data. 

Furthermore, document icons are displayed on the left of 
the document data names . When the document icons are specified, 
the contents of the document data are displayed. Moreover, 

20 divided document icons are displayed on the right of the 
document data names. The position of divided document data in 
the document data, and the number of divided documents in the 
document data, are displayed in the divided document icons . The 
divided document data in the document data can be displayed by 

25 specifying a divided document icon. 
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Furthermore, document data names of divided document data 
having a high degree of similarity to the category 
representative value are arranged at the head of the list of 
document data names. Furthermore, when multiple divided 
5 document data created from the same document data belong to the 
same classification category, only a document data name 
corresponding to the divided document data having the highest 
ii==l degree of similarity is displayed. 

jg Thus, according to the sixteenth embodiment, since only 

ly 10 information representing document data, representative 
: : p information accompanying the document data, and information 

si representing divided document data, representative 

fy information accompanying the divided document data, are 

y displayed as the classification result information, the user 

'"'C 15 can easily comprehend the overall classification structure of 
the document data in detail, and can easily comprehend which 
document data has been classified in which category, and the 
like. 

The document classification device and document 
20 classification method of the present invention have been 
explained above, and programs for executing the document 
classification method can be recorded on a detachable and 
computer-readable recording medium, and the document 
classification according to the present invention can be 
25 carried out by the recording medium within the above-mentioned 
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data processing device. 

As described above, according to one aspect of this 
invention, the document processor of the present invention 
comprises a document memory for storing input document data; 
5 a selection unit for selecting all or part of document data 
stored in the documents memory; a characteristics extraction 
unit for extracting data relating to characteristics of letter 
rows from all or part of the document data selected by the 
selection unit; a work processing unit for work-processing all 

10 or part of the document data based on the data relating to 
characteristics of letter rows extracted by the characteristics 
extraction unit; and an output unit for outputting all or part 
of the document data work-processed by the work processing unit . 
Consequently, when analyzing documents according to their 

15 meanings, rather than merely outputting the result of the 
analysis, the entire information analysis operation can be 
supported. 

Further, the output unit comprises an item value set unit 
for setting a plurality of item values based on the contents 

20 of all or part of the document data work-processed by the 
work-processing unit; and a totalization unit for totalizing 
all or part of the document data for each item value set by the 
item value set unit. Furthermore, the output unit outputs all 
or part of the document data in the format of a table having 

25 an item value as at least one axis. Consequently, the result 



of the work-processing can easily be expressed in a cross table, 
and the contents of the information can easily be grasped. 
Therefore, when analyzing documents according to their meanings, 
rather than merely outputting the result of the analysis, the 
5 entire information analysis operation can be supported. 

Further, the output unit outputs all or part of the 
document data work-processed by the work processing unit 
together with all or part of the document data in its state prior 
to work-processing by the work processing unit. Consequently, 

10 data to be work-processed and other data can be displayed 
simultaneously and identified, whereby the range of the 
work-processing to be carried out can be accurately and easily 
determined. Therefore, when analyzing documents according to 
their meanings, rather than merely outputting the result of the 

15 analysis, the entire information analysis operation can be 
supported . 

Further, the document memory also stores all or part of 
the document data work-processed by the work processing unit. 
Consequently, since other data can be handled simultaneously, 
20 when thereafter analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the selection unit further selects all or part 
of the document data output by the output unit. Consequently, 
25 since all or part of the document data output by the output unit 



can be selected for analysis, a wide variety of information can 
be analyzed with high precision. Therefore, when analyzing 
documents according to their meanings, rather than merely 
outputting the result of the analysis, the entire information 
5 analysis operation can be supported. 

Further, the document memory further stores data relating 
to contents of the work processing. Consequently, not only can 
m loss of data relating to the contents of work-processing can 

be prevented and the data managed easily, but also the 
10 relationship between settings used in the work-processing and 
[£» the processed, result can be determined. Therefore, when 

J 8 * analyzing documents according to their meanings, rather than 

fy merely outputting the result of the analysis, the entire 

vj information analysis operation can be supported, 

hfl 15 According to the another aspect of this invention, the 

document classification device according to the present 
invention comprises an input unit for inputting document data; 
a language analyzer unit for analyzing document data input by 
the input unit and obtaining language analysis information; a 
20 vector creation unit for document characteristic vectors for 
the document data based on the language analysis information 
obtained by the language analyzer unit; a classification unit 
for classifying documents based on the degree of similarity 
between document characteristic vectors created by the vector 
25 creation unit, and creating clusters of documents; a cluster 



characteristics calculation unit for calculating cluster 
characteristics, which are characteristics of clusters of 
documents created by the classification unit; and a 
classification category memory for storing cluster 
5 characteristics, calculated by the cluster characteristics 
calculation unit, as constituent elements of classification 
categories. Consequently, it is possible to obtain clusters, 
, ; ~ and to structure and categorize the clusters based on their 

g contents using their degree of similarity to the cluster center, 

j=y 10 and the like. Therefore, it is possible to gradually determxne 
what kind of contents are contained in a given document cluster. 
si According to the another aspect of this invention, the 

fy document classification device comprises an input unit for 

H! inputting document data; a language analyzer unit for analyzing 

: 43 15 document data input by the input unit and obtaining language 
analysis information; a vector creation unit for creating 
document characteristic vectors for the document data based on 
the language analysis information obtained by the language 
analyzer unit; a classification unit for classifying documents 
20 based on the degree of similarity between document 
characteristic vectors created by the vector creation unit, and 
creating clusters of documents; a cluster characteristics 
calculation unit for calculating cluster characteristics, 
which are characteristics of clusters of documents created by 
25 the classification unit; a display unit for displaying the 



cluster characteristics calculated by the cluster 
characteristics calculation unit; a cluster selection 
specification unit for selecting predetermined clusters from 
cluster of documents created by the classification unit; and 
5 a classification category memory for storing cluster 
characteristics, calculated by the cluster characteristics 
calculation unit, as constituent elements of classification 
categories- Consequently, only selected clusters are used, 
making it possible to structure and categorize to clusters in 

10 a manner closer to that desired by the operator. Therefore, 
it is possible to gradually determine what kind of contents are 
contained in a given document cluster. 

Further, the document classification device of the 
present invention described above further comprises a document 

15 characteristic vector memory for storing document 
characteristic vectors created by vector creation unit; and a 
vector correction unit for correcting document characteristic 
vectors stored in the document characteristic vector memory, 
so that document characteristic vectors of documents belonging 

20 to clusters selected by the cluster selection unit are deleted. 
Furthermore, the classification unit classifies documents 
based on the document characteristic vectors corrected by the 
vector correction unit. Consequently, the effects of clusters 
which are already known can be eliminated, and new clusters can 

25 be created. Therefore, it is possible to gradually determine 
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what kind of contents are contained in a given document cluster. 

Further, the document classification device of the 
present invention described above further comprises a document 
characteristic vector memory for storing document 
5 characteristic vectors created by vector creation unit; and a 
document expression space correction unit for correcting 
document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
the document characteristic vectors memory, based on a 

10 characteristics amount calculated from clusters selected by the 
cluster selection unit. Furthermore, the classification unit 
classifies documents based on the degree of similarity between 
document characteristic vectors created by the vector creation 
unit, using the document expression space corrected by the 

15 document expression space correction unit. Consequently, 
cluster characteristics selected by the operator in the 
previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 
Therefore, it is possible to gradually determine what kind of 

20 contents are contained in a given document cluster. 

Further, the document classification device of the 
present invention described above further comprises a document 
characteristic vector memory for storing document 
characteristic vectors created by vector creation unit; and a 

25 document expression space correction unit for correcting 
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document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
the document characteristic vectors memory, based on a 
characteristics amount calculated from clusters selected by the 
5 cluster selection unit. Furthermore, the classification unit 
classifies documents based on the degree of similarity between 
document characteristic vectors created by the vector creation 
unit, using the document expression space corrected by the 
document expression space correction unit. Consequently, 

10 influences of the known cluster can be eliminated and cluster 
characteristics selected by the operator in the previous 
classification can be eliminated from the next classification, 
enabling new clusters to be created. Therefore, it is possible 
to gradually determine what kind of contents are contained in 

15 a given document cluster. 

Further, the document classification device of the 
present invention described in above further comprises a 
selection information appending unit for appending selection 
information showing the fact of selection when all or part of 

20 the documents belonging to a cluster of documents created by 
the classification unit have been selected. Furthermore, the 
display unit displays the cluster characteristics, and also 
displays the selection information appended by the selection 
information appending unit. Consequently, it is possible to 

25 improve the ability to identify documents used on multiple 

123 



occasions, and the ability to identify documents which have not 
been selected at all. Therefore, it is possible to gradually 
determine what kind of contents are contained in a given 
document cluster. 
5 Further, the classification category memory stores 

cluster characteristics and/or information created by an 
operator, in addition to all or part of the documents belonging 
to a cluster of documents selected by the selection 

p. specification unit, as constituent elements of classification 

'"■'-4 

fj 10 categories. Consequently, the contents of clusters can be 
J= easily recognized, and in addition, the operator can easily 

create his own classification categories, thereby improving the 
lit usefulness of the classification categories. Therefore, it is 

S| possible to gradually determine what kind of contents are 

y3 15 contained in a given document cluster. 

According to still another aspect of this invention, the 
document classification device for classifying document 
clusters in accordance with contents thereof of the present 
invention comprises a document input unit for inputting 
20 document data groups; a document dividing unit for dividing 
document data into one or multiple divided document data based 
on a predetermined reference; a document-divided document map 
creation unit for creating a map showing the correspondence 
between the document data and the divided document data; a 
25 divided document classification unit for classifying the 



divided document data; a divided document classification result 
creation unit for creating divided document classification 
result information based on a classification result of the 
divided document classification unit; and a document 
5 classification result creation unit for creating 
classification result information of the above document data 
using the document-divided document map and the divided 
m document classification result information. Consequently, 

*'« when one document contains multiple topics and meanings, these 

ill 10 can be classified into categories according to specific topics 

jE and meanings, so that the classifications do not differ from 

■ ; J3 

categories desired by a user, thereby enabling the. user to 
Jlj easily comprehend the classification categories . Furthermore, 

since the positions of the divided documents in documents prior 
dQ 15 to division (documents belonging to the clusters) is displayed, 
the user is able to efficiently read the parts of the document 
clusters he or she wishes to read. 

Further, the document classification device of the 
present invention described above further comprises a document 
20 save unit for saving the document data; a divided document save 
unit for saving the divided document data; and a document- 
divided document map save unit for saving a document-divided 
document map created by the document-divided document map 
creation unit. Consequently, for a single document data, it 
25 is possible to efficiently determine classification results 
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having different parameters such as the number of 
classifications, the classification method, and the settings 
used in the classifications, without recreating the divided 
document data and the document-divided document map. 
5 Furthermore, by classifying the document data and saving the 
data needed to create the classification result, the user is 
free to take more time over the classification, and to re- 
analyze previously classified documents within a given period 
of time. 

10 Further, the document classification device of the 

present invention described above further comprises a divided 
document classification result save unit for saving divided 
document classification result information created by the 
divided document classification result creation unit. 

15 Consequently, in addition to the effects achieved by the 
specific arrangement of the present invention described above, 
after one classification has been carried out, the result of 
that classification can be expressed in a variety of formats 
such as text, charts, graphs, and the like. Furthermore, by 

20 saving the divided document classification result information, 
the user is free to take more time over classifications and 
analysis of classification results, and to re-analyze 
previously classified documents in a variety of formats within 
a given period of time. 

25 Further, the multiple divided document data created by 



the document dividing unit contains the document data in its 
state prior to being divided. Consequently, in addition to a 
classification structure of detailed document data, obtained 
by classifying the divided document data, the user is able to 
5 obtain a classification structure fusing a schematic macro 
classification as a result classifying the document data itself 
prior to division. 

Further, the document dividing unit divides document data 
based on information relating to the structure of the document 

10 data. Consequently, division and the like of different topics 
can be carried out, whereby documents can be classified in such 
a manner that the detailed classification structures of their 
document data can be known. 

Further, the document classification device further 

15 comprises a document element extraction unit for extracting 
elements in the document data; an element-accompanying 
information extraction unit for extracting element- 
accompanying information accompanying the elements extracted 
by the document element extraction unit. Furthermore, the 

20 document dividing unit divides the document data using elements 
extracted by the document element extraction unit, or the 
elements and element-accompanying information extracted by the 
element-accompanying information extraction unit. 
Consequently, documents can be classified so that the detailed 

25 classification structure of the document data can be known. 



Further, the document dividing unit divides document data 
in compliance with a specified specification range. 
Consequently, documents can be classified in accordance with 
the wishes of the user, and so that the detailed classification 
structure of the document data can be known. 

Further, the document dividing unit divides document data 
based on the number of letters, the number of sentences, or both 
the number of letters and the number of sentences . Consequently, 
there is an increased capability to classify different 
documents having contents of different topics and the like. 
Therefore, as above, documents can be classified so that the 
detailed classification structure of the document data can be 
known . 

Further, the document classification result creation 
unit extracts and presents information showing document data, 
and representative information accompanying the document data, 
as classification result information. Consequently, the user 
is able to determine a detailed schematic structure or overall 
structure of the document data. 

Further, the document classification result creation 
unit extracts and presents information showing divided document 
data, and representative information accompanying the divided 
document data, as classification result information. 
Consequently, the user is able to determine a detailed schematic 
structure or overall structure of the document data. In 
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addition, the user can easily determine which divided document 

has been classified in a given category. 

According to still another aspect of this invention, the 

document processing method of the present invention outputs 
5 multiple input document data in order to display or print the 

document data in a predetermined format, and comprises the steps 

of storing input document data; selecting all or part of the 
iq document data stored in the storing step; extracting data 

j? relating to characteristics of letter rows from all or part of 

jjy 10 the document data selected by the selection step; work- 
,p processing all or part of the document data based on the data 

si relating to characteristics of letter rows extracted in the 

ly characteristics extraction step; and outputting all or part of 

"J the document data work-processed in the work processing step. 

k O 15 Consequently, when analyzing documents according to their 

meanings, rather than merely outputting the result of the 

analysis, the entire information analysis operation can be 

supported. 

Further, the step of outputting comprises the steps of 
. 20 setting a plurality of item values based on the contents of all 
or part of the document data work-processed in the work- 
processing step; and totalizing all or part of the document data 
for each item value set in the item value set step; and outputs 
all or part of the document data in the format of a table having 
25 an item value as at least one axis. Consequently, the result 
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of the work-processing can easily be expressed in a cross table, 
and the contents of the information can easily be grasped. 
Therefore, when analyzing documents according to their meanings, 
rather than merely outputting the result of the analysis, the 
5 entire information analysis operation can be supported. 

Further, the step of outputting further comprises 
outputting all or part of the document data work-processed in 
O the work processing step together with all or part of the 

:: p document data in its state prior to work-processing in the work 

ly 10 processing step. Consequently, data to be work-processed and 
4= other data can be displayed simultaneously and identified, 

« whereby the range of the work-processing to be carried out can 

J;^ be accurately and easily determined. Therefore, when 

:! analyzing documents according to their meanings, rather than 

•™ 15 merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

Further, the step of storing further comprises storing 
all or part of the document data work-processed in the work 
processing step. Consequently, since other data can be handled 
20 simultaneously, when thereafter analyzing documents according 
to their meanings, rather than merely outputting the result of 
the analysis, the entire information analysis operation can be 
supported . 

Further, the step of selecting further comprises 
25 selecting all or part of the document data output in the output 
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step. Consequently, since all or part of the document data 
output in the output step can be selected for analysis, a wide 
variety of information can be analyzed with high precision. 
Therefore, when analyzing documents according to their meanings, 
5 rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the step of storing a document further comprises 
storing data relating to contents of the work processing. 
Consequently, not only can loss of data relating to the contents 

10 of work-processing can be prevented and the data managed easily, 
but also the relationship between settings used in the 
work-processing and the processed result can be determined. 
Therefore, when analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 

15 entire information analysis operation can be supported. 

According to still another aspect of this invention, the 
document classification method of the present invention 
comprises the steps of inputting document data; language- 
analyzing document data input in the step of inputting and 

20 obtaining language analysis information; creating document 
characteristic vectors for the document data based on the 
language analysis information obtained in the step of 
language-analyzing; classifying documents based on the degree 
of similarity between document characteristic vectors created 

25 in the step of creating vectors, and creating clusters of 
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documents; calculating cluster characteristics, being 
characteristics of clusters of documents created in the step 
of classifying; and storing cluster characteristics, 
calculated in the step of calculating cluster characteristics, 
as constituent elements of classification categories. 
Consequently, it is possible to obtain clusters, and to 
structure and categorize the clusters based on their contents 
using their degree of similarity to the cluster center, and the 
like. Therefore, it is possible to gradually determine what 
kind of contents are contained in a given document cluster. 

According to still another aspect of this invention, the 
document classification method of the present invention 
comprises the steps of inputting document data; language- 
analyzing document data input in the step of inputting and 
obtaining language analysis information; creating document 
characteristic vectors for the document data based on the 
language analysis information obtained in the step of 
language-analyzing; classifying documents based on the degree 
of similarity between document characteristic vectors created 
in the step of creating vectors, and creating clusters of 
documents; calculating cluster characteristics, which are 
characteristics of clusters of documents created in the step 
of classifying; displaying the cluster characteristics 
calculated in the step of calculating cluster characteristics; 
selecting predetermined clusters from cluster of documents 



created in the step of classifying; and storing cluster 
characteristics, calculated in the step of calculating cluster 
characteristics, as constituent elements of classification 
categories. Consequently, only selected clusters are used, 
making it possible to structure and categorize to clusters in 
a manner closer to that desired by the operator. Therefore, 
it is possible to gradually determine what kind of contents are 
contained in a given document cluster. 

Further, the document classification method of the 
present invention described above further comprises a step of 
correcting document characteristic vectors stored in the step 
of storing document characteristic vectors, so that document 
characteristic vectors of documents belonging to clusters 
selected by the step of selecting clusters are deleted. 
Furthermore, the step of classifying comprises classifying 
documents based on the document characteristic vectors 
corrected by the step of correcting vectors . Consequently, the 
effects of clusters which are already known can be eliminated, 
and new clusters can be created. Therefore, it is possible to 
gradually determine what kind of contents are contained in a 
given document cluster. 

Further, the document classification method of the 
present invention described above further comprises a step of 
correcting document expression space when determining the 
degree of similarity between document characteristic vectors 



stored in the step of storing document characteristic vectors, 
based on a characteristics amount calculated from clusters 
selected in the step of selecting clusters, and the step of 
classifying comprises classifying documents based on the degree 
5 of similarity between document characteristic vectors created 
in the step of creating vectors, using the document expression 
space corrected in the step of correcting the document 
^ expression space. Consequently, cluster characteristics 

"iS selected by the operator in the previous classification can be 

"*sl 

pj 10 eliminated from the next classification, enabling new clusters 
jf to be created. Therefore, it is possible to gradually determine 

25 what kind of contents are contained in a given document cluster, 

fy Further, the document classification method of the 

%j present invention described above further comprises the steps 

dp 15 of correcting document expression space when determining the 
degree of similarity between document characteristic vectors 
stored in the step of storing document characteristic vectors, 
based on a characteristics amount calculated from clusters 
selected in the step of selecting clusters. Furthermore, the 
20 step of classifying comprises classifying documents based on 
the degree of similarity between document characteristic 
vectors created in the step of creating vectors, using the 
document expression space corrected in the step of correcting 
the document expression space. Consequently, influences of 
25 the known cluster can be eliminated and cluster characteristics 




selected by the operator in the previous classification can be 
eliminated from the next classification, enabling new clusters 
to be created. Therefore,, it is possible to gradually determine 
what kind of contents are contained in a given document cluster. 
5 Further, the document classification method of the 

present invention described above further comprises the steps 
of appending selection information showing the fact of 
selection when all or part of the documents belonging to a 
cluster of documents created in the step of classifying have 

10 been selected. Furthermore, the step of displaying comprises 
displaying the cluster characteristics, and displaying the 
selection information appended in the step of appending 
selection information. Consequently, it is possible to 
improve the ability to identify documents used on multiple 

15 occasions, and the ability to identify documents which have not 
been selected at all. Therefore, it is possible to gradually 
determine what kind of contents are contained in a given 
document cluster. 

Further, the step of creating classification categories 

20 comprises creating cluster characteristics and/or information 
created by an operator, in addition to all or part of the 
documents belonging to a cluster of documents selected in the 
step of specifying selection, as constituent elements of 
classification categories. Consequently, the contents of 

25 clusters can be easily recognized, and in addition, the operator 
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can easily create his own classification categories, thereby 
improving the usefulness of the classification categories. 
Therefore, it is possible to gradually determine what kind of 
contents are contained in a given document cluster. 
5 According to still another aspect of this invention, the 

document classification method according to the present 
invention comprises the steps of inputting document data 
groups; dividing document data into one or multiple divided 
document data based on a predetermined reference; creating a 

10 map showing the correspondence between the document data and 
the divided document data; classifying the divided document 
data; creating divided document classification result 
information based on the classification result of classifying 
the divided documents; and creating classification result 

15 information of the document data using the document-divided 
document map and the divided document classification result 
information. Consequently, when one document contains 
multiple topics and meanings, these can be classified into 
categories according to specific topics and meanings, so that 

20 the classifications do not differ from categories desired by 
a user, thereby enabling the user to easily comprehend the 
classification categories. Furthermore, since the positions 
of the divided documents in documents prior to division 
(documents belonging to the clusters) is displayed, the user 

25 is able to efficiently read the parts of the document clusters 
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he or she wishes to read. 

According to still another aspect of this invention, a 
computer-readable recording medium of the present invention 
stores programs for executing the above-described document 
classification method on a computer, thereby making the program 
readable mechanically, and enabling the operation of the 
document classification method to be executed by a computer. 

Although the invention has been described with respect 
to a specific embodiment for a complete and clear disclosure, 
the appended claims are not to be thus limited but are to be 
construed as embodying all modifications and alternative 
constructions that may occur to one skilled in the art which 
fairly fall within the basic teaching herein set forth. 
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